Bio-Informatics R&D and IDM in
the Next Decade
Co-Chairs:
Alfonso F. Cardenas, UCLA and Mark Embrechts, Rensselaer Polytechnic Institute
Mission: Contribute to the achievement of
Bioxx (biomedical, bioinformatics, biological, biodiversity, ... ) challenges
through collaboration and innovation in both the Bioxx area and CS IDM area.
Recommendations:
The importance, magnitude and relevance to our lives of Bioxx challenges to which CS and IDM may and should contribute is evident by many measures. To further this contribution we recommend the following directions and actions for IDM R&D in Bio-Informatics in the next several years:
Facilitate interdisciplinary R&D collaboration by
Improving the review team structure and process to include both domain and CS experts
Providing funding venues for interdisciplinary collaboration
Meeting of the minds workshops of Bioxx and IDM
Linking IDM scientists with Bioxx peers at the early stages of projects
The development of the digital patient which has major CS and IDM challenges, including
Logical integration of widely dispersed and heterogeneous information
Linking a variety of widely different autonomous files and data bases
Wireless communication of the mobile patient
Real time body sensors generating much information
A patient, which may be one being, several beings, part of an ecosystem, or an entire ecosystem
Modeling of Bioxx information
Atlases, classifications and taxonomies of beings, organs, and diseases
Conceptual modeling and process modeling of diverse and complex multimedia and their behavior
Flexibility in schema evolution while maintaining data independence
Automated and semi-automated means to discover the schema, often not known
Data bases and hypothesis and R&D for comparison of approaches
Patient information and advise on diseases and treatment
Support for verification and quality control of simulation models
Data mining
Automated search of data for information, knowledge, wisdom, and innovation in large datasets.
Textual and non-textual
structural pattern matching and pattern discovery
DNA mining and protein
folding
Mining of molecular datasets
Multimedia displays and
visualization
2D, 3D and virtual reality
Body channels and tissues
Interrelated biodiversity data bases with multi-dimensional and time-varying features
Quality control, metrics and benchmarks
Security, privacy, and ownership of dispersed and largely autonomous data
Details:
We recommend facilitating interdisciplinary collaboration as follows:
A driving force requiring advances in many CS areas including IDM is the goal generally referred to as the digital patient. One of the challenges is to deal with a multitude of diverse, distributed and heterogeneous patient data. Conventional database technology is able to deal with only a small part now. There is demographic alphanumeric data in structured files or data bases, long text data such as transcriptions of voice dictations, audio data such as bird sound recordings, photographs, digital image data including a variety of modalities such as X-rays and MRIs and CT scans, molecular data, etc., all of which must be accessible as if it were all in one or a few places.
We are witnessing advances whereby body sensors are being placed in a patient to gather a variety of streams of information, going beyond the current simpler and bulkier carry-on recording devices such as EKG recorders and other electromechanical recorders. The amount of data gathered by the increasing number of sensors being deployed will be huge. There are issues of data hierarchies and uploading and downloading as some data will remain in the body data base and other will be transmitted to a larger data base. Issues of data base security and reliability arise.
Human beings are mobile, and thus we have to deal with the mobile digital patient. Data transfer to and from the mobile patient and access by human beings has to be largely via wireless means, and with reliability and security beyond the current state of the art.
Multimedia data modeling and process modeling advances are needed to meet various Bioxx needs. Initial atlases, genomic classifications and various taxonomies of human elements and organs, and of animal and plant species, are being built by Bioxx scientists. The evolution of such organisms through the ages and their diseases are being investigated. However, there is much data that is complex, diverse and time dependent and that now is within reach of digitizing and placing on the web, posing challenges in conceptual modeling beyond today’s data base models and systems (e.g., modeling high-dimensional and large volume data, molecular structures).
The notions of “schema first” and then data access do not work in many Bioxx applications in which most or all of the schema is unknown, such as relationships among various data types. The objective of many Bioxx challenges is to find out what the relevant types of data and their interrelationships are, that is, what the schema is. Thus, the challenge is to develop automated and semi-automated means for discovering the schema. Due to the evolutionary nature of Bioxx data and the on-going scientific discovery, schemas change and necessitate further developments in schema evolution flexibility to achieve more data independence.
Collaboration with other disciplines within CS is also recommended, including artificial intelligence, machine learning, and bio-engineering.
Advances are needed in the access to multimedia data based on multimedia and spatio-temporal features (e.g., similarity retrieval of a patient case (human, animal, ecosystem) to past similar cases through time, with known treatments or episodes and outcome; retrieval based on three- dimensional protein structures), visual access languages and interfaces, voice interfaces, access by multimedia features not expressed in simple alphanumeric terms (such as shape, audio, and interrelated features), handling features which may be streams or time dependent patterns, image segmentation/recognition in collaboration with the image processing community, access methods, and access of distributed heterogeneous multimedia data residing in different types of storage technologies. Incorporation of these scalable algorithms/methods for these searches in an efficient and seamless way into the data base system is needed.
There is the need for publicly available test baseline databases and challenge hypothesis and questions. Competing approaches and their relative merits can then be compared with such common test beds – usually not possible in current practice. It is estimated that currently about 25 percent of the web searching and use relates to health care on specific diseases and advise on prevention and treatment. The collection and the availability of disease repositories for data and advise are very important. The sensitivity and need for reputable advise is a key concern.
Special attention might be drawn on the collection and maintenance of molecular data for the purposes of molecular data mining, molecular matching and potential future uses of molecular data such as data mining for molecules that turn specific genes on and off or the future use of molecular data sets for the virtual design of pharmaceuticals.
IDM and
simulation and model verification of bioxx processes
In order to better understand the causality,
diagnostic and mitigating aspects related to diseases and biomedical processes,
elaborate simulation models with detailed mathematical models (e.g.
differential equation models) are being called on. IDM issues related to the
collection and matching of actual data vs. simulated data for model
verifications arise. Some computer models require an enormous amount of
computing time. Resulting data from simulation programs may be accumulated in a
specialized database that could be subsequently accessed and used as a virtual
reality type of substitute for the slow running program to provide an immediate
answer, or be used as the basis for a data-driven expert system.
As
the amount of collected data increases more than exponentially, the data mining
process and the procedure of converting data into information into knowledge
into understanding and possibly into wisdom and innovation becomes a more
daunting and challenging task. This is particularly true when the application
of the data mining process relates to disease diagnostics and cause detection.
Also the important tasks of sequencing and matching related to DNA mining and
protein folding would possibly have one of the most significant and ever
lasting impacts on society during this decade. Non-textual structural pattern
matching and pattern discovery challenges occur in nearly every scientific
field, e.g., 3D protein comparisons, graph structure comparisons. Approaches
are needed to assist in finding what the schema is, if any, for a collection of
data, particularly complex and diverse multimedia data, not just large volumes
of data.
As the collection of large amounts of multimedia data and the dimensionality and complexity of data keeps on expanding rapidly, new and innovative methods for 2-D and 3-D visualization and virtual reality modeling emerge as important challenges. Examples are the navigation through living organism channels and less invasive visualization through body tissues; visualization of large and interrelated biodiversity databases with multi-dimensional and time-varying characteristics, such as changes in species migration patterns and their tie-in to changes in ecological feature patterns.
Rapidly increasing amounts of data will make quality control an ever more important issue. Also crucial are the awareness and need for defining quantifiable quality control database metrics.
The definition and availability of benchmark data and problems for a variety of domain and method specific areas is extremely important to evolve the proper metrics. The availability of challenge problems might be crucial to benchmark the performance of various IDM related methods and approaches. From these, domain independent benchmark data may evolve for more generic domain-independent innovations, such as data mining algorithms.
Security, privacy,
and ownership
Issues related to the privacy and security of medical and biomedical information is extremely important to maintain a free society with respect and rights for the individual. Large amounts of data collected, bought, or generated by industry raise proprietary issues, as well as use and reuse issues. Certain industries and individuals protective of important data repositories they have collected might be more than willing to make valuable datasets available for research purposes or to society at large if ownership and proprietary issues are properly addressed. With the current state of web controls, many repositories are not accessible.
Much of the focus of bio-informatics research is on the IDM challenges of the medical, genomic, and molecular domains. There is also a need to address organismic- and ecosystem-scale datasets and the knowledge construction activities occurring within other biological domains. For example, knowledge about biodiversity and ecosystems, even though incomplete, is a vast and complex information domain involving millions of species, each of which is highly variable across individual organisms and populations. These species each have complex chemistries, physiologies, developmental cycles and behaviors, all resulting from more than three billion years of evolution. There are hundreds if not thousands of ecosystems, each comprising complex interactions among large numbers of species, and between those species and multiple abiotic factors.
A major computational challenge is in understanding and
identifying the causes of population change. Studies of biodiversity cannot be
divorced from the study of the ecosystems that support the flora and fauna of
the planet, and the way in which these ecosystems are affected by such major
factors as climatic change and industrial pollution. Thus, although localized
studies are often possible and meaningful, some aspects and elements of
biodiversity must be studied on a global scale. Perhaps more than in any other
area, the range of data types with which biodiversity scientists wish to mesh
their databases is very broad: geographical, geological, meteorological,
chemical, genomic, biochemical, etc.
A concerted effort is needed so that the masses of data and information that are stored in the museums, libraries, and government agencies of this country, and that are generated daily by Mission to Planet Earth and other activities, can be put to good use. This will require extending research on mobile computing, modeling and simulation, multimedia data mining, visualization, and security and privacy issues to the biodiversity and ecosystems domain.