Bio-Informatics R&D and IDM in the Next Decade

 

Co-Chairs: Alfonso F. Cardenas, UCLA and Mark Embrechts, Rensselaer Polytechnic Institute

 

 

Mission: Contribute to the achievement of Bioxx (biomedical, bioinformatics, biological, biodiversity, ... ) challenges through collaboration and innovation in both the Bioxx area and  CS IDM area.

 

Recommendations:

 

The importance, magnitude and relevance to our lives of Bioxx challenges to which CS and IDM may and should contribute is evident by many measures. To further this contribution we recommend the following directions and actions for IDM R&D in Bio-Informatics in the next several years:

 

 

Facilitate interdisciplinary R&D collaboration by

Improving the review team structure and process to include both domain and CS experts

Providing funding venues for interdisciplinary collaboration

Meeting of the minds workshops of Bioxx and IDM

Linking IDM scientists with Bioxx peers at the early stages of  projects

 

The development of the digital patient which has major CS and IDM challenges, including

Logical integration of widely dispersed and heterogeneous information

Linking a variety of widely different autonomous files and data bases

Wireless communication of the mobile patient

Real time body sensors generating much information

A patient, which may be one being, several beings, part of an ecosystem, or an entire ecosystem

 

Modeling of Bioxx information

Atlases, classifications and taxonomies of beings, organs, and diseases

Conceptual modeling and process modeling of diverse and complex multimedia and their behavior

Flexibility in schema evolution while maintaining data independence

Automated and semi-automated means to discover the schema, often not known

 

Baseline data bases/libraries

Data bases and hypothesis and R&D for comparison of approaches

Patient information and advise on diseases and treatment

 

Support for verification and quality control of simulation models

 

 

 

Data mining

Automated search of data for information, knowledge, wisdom, and innovation in large datasets.

Textual and non-textual structural pattern matching and pattern discovery

DNA mining and protein folding

Mining of molecular datasets

 

Multimedia displays and visualization

2D, 3D and virtual reality

Body channels and tissues

Interrelated biodiversity data bases with multi-dimensional and time-varying features

 

Quality control, metrics and benchmarks

 

Security, privacy, and ownership of dispersed and largely autonomous data

 

 

Details:

 

We recommend facilitating interdisciplinary collaboration as follows:

 

  1. Improve the review process and structure, such that the review team is composed of the application domain experts attesting the importance of final results, discipline scientists attesting the feasibility, and CS/IDM scientists capable of understanding, appreciating and passing fair judgment on all parts of the proposal.

 

  1. Set up incentives and infrastructure for it. Establish grant initiatives and/or budget line items among collaborating NSF groups, as well as between NSF and other agencies that are the primary sources of funds for the Bioxx areas using model whereby the Bioxx agency funds the science portion and NSF funds the CS portion. Incubation and exploratory type of grants should be supported.

 

  1. Organize meeting of the minds workshops of Bioxx and IDM potential collaborators. We recommend workshops in the biological, biomedical and bioecosystems areas, with one such workshop supported by the NBII and NSF IDM already scheduled for summer 2000. We recognize the importance of engaging with collaborators in the early stages of envisioning the major Bioxx problems that identify the IDM challenges, and avoid the frequent problem of collaboration openings that appear well into the life cycle of science-funded projects primarily for IT implementations with little CS challenge and innovation.

 

 

The Digital Patient

 

A driving force requiring advances in many CS areas including IDM is the goal generally referred to as the digital patient. One of the challenges is to deal with a multitude of diverse, distributed and heterogeneous patient data. Conventional database technology is able to deal with only a small part now. There is demographic alphanumeric data in structured files or data bases, long text data such as transcriptions of voice dictations, audio data such as bird sound recordings, photographs, digital image data including a variety of modalities such as X-rays and MRIs and CT scans, molecular data, etc., all of which must be accessible as if it were all in one or a few places.

 

We are witnessing advances whereby body sensors are being placed in a patient to gather a variety of streams of information, going beyond the current simpler and bulkier carry-on recording devices such as EKG recorders and other electromechanical recorders. The amount of data gathered by the increasing number of sensors being deployed will be huge. There are issues of data hierarchies and uploading and downloading as some data will remain in the body data base and other will be transmitted to a larger data base. Issues of data base security and reliability arise.

 

Wireless mobile beings

 

Human beings are mobile, and thus we have to deal with the mobile digital patient. Data transfer to and from the mobile patient and access by human beings has to be largely via wireless means, and with reliability and security beyond the current state of the art.

 

Modeling of Bioxx information

 

Multimedia data modeling and process modeling advances are needed to meet various Bioxx needs. Initial atlases, genomic classifications and various taxonomies of human elements and organs, and of animal and plant species, are being built by Bioxx scientists. The evolution of such organisms through the ages and their diseases are being investigated. However, there is much data that is complex, diverse and time dependent and that now is within reach of digitizing and placing on the web, posing challenges in conceptual modeling beyond today’s data base models and systems (e.g., modeling high-dimensional and large volume data, molecular structures).

 

The notions of “schema first” and then data access do not work in many Bioxx applications in which most or all of the schema is unknown, such as relationships among various data types. The objective of many Bioxx challenges is to find out what the relevant types of data and their interrelationships are, that is, what the schema is. Thus, the challenge is to develop automated and semi-automated means for discovering the schema. Due to the evolutionary nature of Bioxx data and the on-going scientific discovery, schemas change and necessitate further developments in schema evolution flexibility to achieve more data independence.

 

Collaboration with other disciplines within CS is also recommended, including artificial intelligence, machine learning, and bio-engineering.

 

 

Access to multimedia on multimedia features

 

Advances are needed in the access to multimedia data based on multimedia and spatio-temporal features (e.g., similarity retrieval of a patient case (human, animal, ecosystem) to past similar cases through time, with known treatments or episodes and outcome; retrieval based on three- dimensional protein structures), visual access languages and interfaces, voice interfaces, access by multimedia features not expressed in simple alphanumeric terms (such as shape, audio, and interrelated features), handling features which may be streams or time dependent patterns, image segmentation/recognition in collaboration with the image processing community, access methods, and access of distributed heterogeneous multimedia data residing in different types of storage technologies. Incorporation of these scalable algorithms/methods for these searches in an efficient and seamless way into the data base system is needed.

 

Baseline data bases/libraries, and reputable advise

 

There is the need for publicly available test baseline databases and challenge hypothesis and questions. Competing approaches and their relative merits can then be compared with such common test beds – usually not possible in current practice. It is estimated that currently about 25 percent of the web searching and use relates to health care on specific diseases and advise on prevention and treatment. The collection and the availability of disease repositories for data and advise are very important. The sensitivity and need for reputable advise is a key concern.

 

Special attention might be drawn on the collection and maintenance of molecular data for the purposes of molecular data mining, molecular matching and potential future uses of molecular data such as data mining for molecules that turn specific genes on and off or the future use of molecular data sets for the virtual design of pharmaceuticals.

 

IDM and simulation and model verification of bioxx processes

 

In order to better understand the causality, diagnostic and mitigating aspects related to diseases and biomedical processes, elaborate simulation models with detailed mathematical models (e.g. differential equation models) are being called on. IDM issues related to the collection and matching of actual data vs. simulated data for model verifications arise. Some computer models require an enormous amount of computing time. Resulting data from simulation programs may be accumulated in a specialized database that could be subsequently accessed and used as a virtual reality type of substitute for the slow running program to provide an immediate answer, or be used as the basis for a data-driven expert system.

 

Data mining

 

As the amount of collected data increases more than exponentially, the data mining process and the procedure of converting data into information into knowledge into understanding and possibly into wisdom and innovation becomes a more daunting and challenging task. This is particularly true when the application of the data mining process relates to disease diagnostics and cause detection. Also the important tasks of sequencing and matching related to DNA mining and protein folding would possibly have one of the most significant and ever lasting impacts on society during this decade. Non-textual structural pattern matching and pattern discovery challenges occur in nearly every scientific field, e.g., 3D protein comparisons, graph structure comparisons. Approaches are needed to assist in finding what the schema is, if any, for a collection of data, particularly complex and diverse multimedia data, not just large volumes of data.

Multi-media displays and visualization

 

As the collection of large amounts of multimedia data and the dimensionality and complexity of data keeps on expanding rapidly, new and innovative methods for 2-D and 3-D visualization and virtual reality modeling emerge as important challenges. Examples are the navigation through living organism channels and less invasive visualization through body tissues; visualization of large and interrelated biodiversity databases with multi-dimensional and time-varying characteristics, such as changes in species migration patterns and their tie-in to changes in ecological feature patterns.

 

Quality Control, Metrics and Benchmarks

 

Rapidly increasing amounts of data will make quality control an ever more important issue. Also crucial are the awareness and need for defining quantifiable quality control database metrics.

 

The definition and availability of benchmark data and problems for a variety of domain and method specific areas is extremely important to evolve the proper metrics. The availability of challenge problems might be crucial to benchmark the performance of various IDM related methods and approaches. From these, domain independent benchmark data may evolve for more generic domain-independent innovations, such as data mining algorithms.

 

 

Security, privacy, and ownership

 

Issues related to the privacy and security of medical and biomedical information is extremely important to maintain a free society with respect and rights for the individual. Large amounts of data collected, bought, or generated by industry raise proprietary issues, as well as use and reuse issues. Certain industries and individuals protective of important data repositories they have collected might be more than willing to make valuable datasets available for research purposes or to society at large if ownership and proprietary issues are properly addressed. With the current state of web controls, many repositories are not accessible.

 

Bioxx and biodiversity and ecosystems

 

Much of the focus of bio-informatics research is on the IDM challenges of the medical, genomic, and molecular domains. There is also a need to address organismic- and ecosystem-scale datasets and the knowledge construction activities occurring within other biological domains. For example, knowledge about biodiversity and ecosystems, even though incomplete, is a vast and complex information domain involving millions of species, each of which is highly variable across individual organisms and populations. These species each have complex chemistries, physiologies, developmental cycles and behaviors, all resulting from more than three billion years of evolution. There are hundreds if not thousands of ecosystems, each comprising complex interactions among large numbers of species, and between those species and multiple abiotic factors.

 

A major computational challenge is in understanding and identifying the causes of population change. Studies of biodiversity cannot be divorced from the study of the ecosystems that support the flora and fauna of the planet, and the way in which these ecosystems are affected by such major factors as climatic change and industrial pollution. Thus, although localized studies are often possible and meaningful, some aspects and elements of biodiversity must be studied on a global scale. Perhaps more than in any other area, the range of data types with which biodiversity scientists wish to mesh their databases is very broad: geographical, geological, meteorological, chemical, genomic, biochemical, etc.

 

A concerted effort is needed so that the masses of data and information that are stored in the museums, libraries, and government agencies of this country, and that are generated daily by Mission to Planet Earth and other activities, can be put to good use. This will require extending research on mobile computing, modeling and simulation, multimedia data mining, visualization, and security and privacy issues to the biodiversity and ecosystems domain.