NSF Grant IIS-9984296: CAREER: Evolving Heterogeneous Autonomous Databases

Principal Investigator

Sudarshan S. Chawathe
Department of Computer Science
University of Maryland, College Park
A.V. Williams Bldg.
College Park MD 20742-3255

301-405-3977
301-405-6707
chaw@cs.umd.edu
http://www.cs.umd.edu/~chaw/

Keywords

semistructured data
streaming data
XML

Project Summary

The heterogeneous, autonomous, and highly dynamic nature of databases on the Web renders most traditional data management techniques ineffective. Consequently, keeping abreast of relevant information as it evolves in this environment requires a large amount of human effort. The goal of this research is to develop, implement, and evaluate methods for data management in a heterogeneous, autonomous, and dynamic environment. The history of changes made to data is a major focus. Methods for efficiently discovering, storing, browsing, and querying changes are developed. In sharp contrast to traditional database techniques, these methods try to make the best possible use of the structure in data without requiring data to conform to a fixed structure or schema. The educational component includes redesigning the current database curriculum; development of new courses on XML, related Web technologies, and semistructured data; and experimenting with pedagogical tools such as teaching by program dissection (the use of the source code of a large program as an anchor for studying related topics). In addition to graduate students, undergraduate students are exposed to research by their work on the system prototype, which in turn benefits from their efforts. In summary, this research will bring the benefits of database technology (e.g., faster, easier, and more precise information access) to the Web by developing methods that cope with the heterogeneous, autonomous, and dynamic nature of Web databases.

In our effort to efficiently manage changes in semistructured data, we are working on methods for indexing changes to XML data. In the process, we have developed a method for context-sensitive indexing of XML (or similar) data, based on extending inverted files. This method is implemented in the Cextor system, which we have made freely available. Cextor is implemented as a context-sensitive search engine for XML. We have extended the standard boolean queries used by many Information Retrieval systems to include context specifiers. These context specifiers are similar in spirit to those used by some current search engines (e.g., title:databases). However, in contrast to earlier work on indexing such data, the context is data dependent and not known in advance. Further, Cextor does not stop at listing documents matching a given user query. Instead, it uses the contexts of search term occurrences in the documents forming a query result to group those documents into a tree structure. This tree structure is a convenient method for displaying very large search results, which occur frequently when querying the Web. Building on our earlier work in differencing hierarchical data, we have also developed a method for computing the difference between two streams of hierarchical data.

Our recent work has focused on streaming XML data. In this environment, each data item is presented to the system only once, in an order determined by the source. Revisiting a data item is possible only if it is buffered by the system. However, naive buffering methods rarely work well because of the limited buffer space relative to the size of the stream (which may be unbounded). We have implemented the XSQ query engine for evaluating XPath queries on such data. We are now focusing on performance improvements for XSQ and on generalizations to a wider class of queries.

Publications and Products

Sudarshan S. Chawathe, "Comparing hierarchical data in external memory", (1999). In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. September 1999. Edinburgh, Scotland, U.K.

Feng Peng and Sudarshan S. Chawathe, "XPath Queries on Streaming Data", (2003). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). June 2003. San Diego, California.

Feng Peng and Sudarshan S. Chawathe, "XSQ: Streaming XPath Queries", (2003). Demonstration description. In Proceedings of the International Conference on Data Engineering. March 2003. Bangalore, India.

Feng Peng and Sudarshan S. Chawathe, "XSQ: A Streaming XPath Engine", (2003). Technical report. University of Maryland

Sudarshan S. Chawathe, "Managing Historical XML Data", (2003). Advances in Computers Bibliography: Volume 57, Chapter 3. Elsevier Science.

Sudarshan S. Chawathe, "Semistructured Data in Relational Databases", (2003). In Practical Handbook of Internet Computing. CRC Press.

Thomas Baby and Sudarshan S. Chawathe, "Context-Sensitive Search and Exploration of XML text", (2001). Technical Report. University of Maryland.

Sudarshan S. Chawathe, "Differencing Very Large Data Streams", (2002). Technical Report, University of Maryland.

Sudarshan S. Chawathe, "Tracking Hidden Groups Using Communications", (2003). In Proceedings of the NSF/NIJ Symposium on Intelligence and Security Informatics. June 2003. Lecture Notes in Computer Science (LNCS) 2665. Tucson, Arizona.

Akhil Gupta and Sudarshan S. Chawathe, "XML Stream-Skipping Using XHints", (2003). Technical report. University of Maryland.

Abheek Anand and Sudarshan S. Chawathe, "Disseminating Data in a Peer-to-Peer Network", (2003). Technical report. University of Maryland

Feng Peng and Sudarshan S. Chawathe, "Streaming Evaluation of XPath Queries with Subqueries", (2003). Technical report. University of Maryland

Project Impact

This project has generated new methods for detecting and managing changes in semistructured data, especially XML. The Cextor system is our implementation of a context-aware XML search engine. Another area of focus has been interactive exploration of semistructured data. Here, we have developed methods that provide very fast initial summaries of unknown datasets as well as permitting gradual increase in level of detail. The VQBD system (link below) implements some of these methods. The XSQ XPath query engine for streaming XML data is based on new automaton-based methods for XPath query evaluation. Work on a sister project has resulted in data mining methods implemented in the Seus system (link below).

Goals, Objectives and Targeted Activities

The goal of this research is to develop methods for effectively coping with the evolution of data, managing changes in not only the base data but also the metadata. Our work to date has focused on efficient indexing methods that will form the basis of our subsequent work. We have also worked on novel methods for interacting with large amounts of data. In the coming year, we plan to focus on integration of the Cextor and VQBD systems. Our work has led us to the problem of gradual discovery of data, which is conveniently highlighted by the following scenario: If we are presented with a large (say, 200MB) XML file about which we nothing, how do we go about exploring the data in the file? The key requirement here is that the initial exploration must be very quick. We have already developed some interesting solutions to this problem and plan to combine them and incorporate them in VQBD. In particular, this problem is closely related to the problem of determining frequent structures in data. We plan to explore connections to our work on that problem. These two problems may be viewed as ends of a spectrum ranging from very fast, but rough, methods to slower, more precise methods. We also plan to explore in more detail the stream processing of XML data, especially as it relates to differencing and management of changes. Our recent and continuing work has focused on efficient query evaluation methods for streaming XML data. Having developed a streaming XPath engine, we plan to look at the more general problem of pattern matching and summarization for such data.

Area Background

Traditionally, databases have been very good in managing changes to data as long as the metadata (schema) remains unchanged. Changing the schema of a database is typically very difficult. Methods for schema evolution assume that changes to schema are infrequent. On the other hand, data in a Web environment tends to change rapidly in both content and schema. Our work in this project builds on recent work on semistructured databases. The recent book by Abiteboul, Buneman, and Suciu is a good starting point for exploring this area.

Project Website

http://www.cs.umd.edu/projects/xsq/
XSQ is a XPath query engine for streaming XML.

Project Website

http://www.cs.umd.edu/projects/cextor/
Cextor is a context-aware XML search engine.

Project Website

http://www.cs.umd.edu/projects/vqbd/
VQBD is a system for exploration of semistructured XML data.

Project Website

http://www.cs.umd.edu/projects/seus/
Seus is our testbed software for interactive data mining of semistructured data.