SYSTEMS ISSUES IN DATA MANAGEMENT AND INFORMATION RETRIEVAL The goal of systems related research under the auspices of IDM is to develop a set of software and technologies that is driven by the diverse needs of database users derived from the two ends of a spectrum. At one end, fueling this need, are large enterprises who provide Internet and web-related services, E-commerce -- content providers whose primary need is the scalability, performance, and reliability of database systems. At the other end we encounter the critical data management needs of scientists involved in the process of scientific discovery whose primary need is functionality and extensibility of systems. The group recommends that IDM be cognizant that to develop large-scale, comprehensive systems requires funding significantly above the current average IDM individual grant level. The working group recommends the following funding themes: THEME 1: Support for Active Digital Libraries (long term IDM research) The typical digital library paradigm tends to describe static and passive repositories of data. We need to extend this paradigm to enable ``active" libraries where multiple amounts of diverse types of data (often from non-conventional ``value-added" types of databases) are added frequently. We envision these systems to be very dynamic in nature and self-organizing: * reconfigurable and extensible meta-data management: new meta-data structures that can efficiently describe large volumes of multimedia and GIS data. * on-the-fly categorization and summarization of data: synopsis/abstraction/summarization of heterogeneous overlapping data sets including estimating , propagating, and communicating accuracy and uncertainty measures within those datasets. * data validation, correction and update in VLDBs, and improved communication of the content of ``active" libraries: A significant part of data loading from real-world data (e.g. satellite imagery) consists of data validation and correction for ``dirty'' data. Systems should improve the automation of these time-consuming tasks. * uncertainty management in data and its propagation: For data which have statistical quality measures (e.g. standard error indicators), these measures should be attached to data and propagated by the system as manipulations of the data occur. * integration of framework/models and system: research on data models and frameworks for system implementation and evaluation. The realization of a system can only be successful if it is based on a solid framework derived from a careful requirement analysis and performance considerations. THEME 2: System Support for Multiple-Paradigms (short and long-term IDM research, in collaboration with HCI and Visualization Programs): New applications require combining multiple query and retrieval paradigms. Current query planning and optimization technology typically focuses on a single paradigm. The challenge is to develop extensible planners, optimizers and cost models that can handle multiple paradigms. * multiple query and acess paradigms: querying can be over text, video, speech, temporal, geo-spatial data or other, to be defined data types. Accessing and retrieval can be content based, keyword based, category based or others. * multiple representations of data: support can be not only for multiple data types and structures, but can also be dynamically reconfigured for efficiency purposes. * constraint databases and interfaces for multiple paradigms: Constraint databases provide a framework in which multiple types of temporal, geographic, CAD or other spatio-temporal and scientific data can be efficiently represented. In this regard, we need to define expressive and efficient query and update languages as well as fast data access techniques, and build user interfaces with data visualization which can be used by a wide class of people, rather than domain specialists only. * multi-lingual IR (e.g. fonts): At least 100 languages are represented on the WWW. Standard representations for fonts and character sets only exist for a small number of languages, and even some major languages have several representations (e.g. Japanese and Chinese). A major support issue is whether to support at the client or at the text server end. THEME 3: Performance Conscious Systems Building (short and long-term IDM research, in collaboration with Networking Division and industry) * integration of Web and database services: Fast routing of user requests/transactions to the the appropriate data server in a large IP-based network under heavy traffic conditions is a core issue in developing effective e-business systems. Long network latencies will have to addressed in the presence of a multi-server operations that are globally dispersed. * integration of process and management in databases: Closer integration of the supported application processes with the underlying data management capabilities should lead to more effective information systems. * scalability, performance, reliability, and benchmarking of information infrastructure/systems development should have conscious awareness of the performance limitations imposed by alternative designs. IDM should support benchmarking for performance measurement of multi-faceted data paradigms. * data management on new networking options: The emergence of high-speed network (Gigabits/sec) that demonstrate Quality-of-Service guarantees coupled with ``infinitely'' fast processors and inexpensive active disk are changing the way networked databases will be constructed. As dynamic fragmentation and automatic migration of data becomes a reality, database will become failure-free even when multiple sites may not be available. Handling of massive data updates and management of data consistency issues will have to be re-examined. * permeation of database technology in network management: as Network Routers (both at the edge and at the core) start employing sophisticated techniques for flow and network management they will increasingly rely on DBMS to store historical and real-time information. Similarly, next-generation DNS and LDAP/LDUP will have a require close interactions between database and networking researchers. THEME 4: Underpinning Software Technologies for Data Intensive Systems (short and long term IDM research activity in collaboration with Industry). Object-oriented technology has shown great promise and has helped the building of robust data services. In light of recent changes, there might be research opportunities in building and validating database services using Java and XML. Other enabling technologies include efficient database support for networking and software techniques for data intensive systems * market competitive object oriented persistent object systems: Providing persistence for objects of arbitrary user defined types makes the technology applicable to non-standard applications. Research efforts on more general models and architecture for persistence are needed. * full-fledged XML and Java database technology: Java and XML are elements of a major emerging software technology and already have an impact in the field. These two technologies could be further enhanced by incorporating multiple model and architecture for persistence as well as transaction technology.