NSF Grant IIS-0081860: Knowledge Discovery in Historical Semistructured Data

Principal Investigator

Sudarshan S. Chawathe
Department of Computer Science
University of Maryland, College Park
A.V. Williams Bldg.
College Park MD 20742-3255

301-405-3977
301-405-6707
chaw@cs.umd.edu
http://www.cs.umd.edu/~chaw/

Keywords

semistructured data
XML
data mining
visual interfaces

Project Summary

The vast majority of data used by scientists, engineers, and decision makers resides in a poorly structured collection of reports, memos, and other documents in a myriad of file formats. The increasing densities and falling prices of storage devices make it practical to store for perpetuity all such data that crosses a scientist's electronic desktop. The resulting Information History has the potential to serve as an intelligent assistant by detecting trends and patterns, suggesting potential collaborators, uncovering relevant documents and data from diverse sources, etc., resulting in dramatic increases in the effectiveness of information use. However, current technology, which focuses on either fully structured or completely unstructured databases, cannot be effectively adapted to extracting knowledge from a large historical semistructured database. The goal of the proposed research is to develop suitable formulations of the knowledge discovery problem for historical semistructured databases and to develop, implement, and evaluate solutions. The research thrusts are: (1) devising knowledge discovery operators for semistructured data and an algebra for them; (2) developing efficient methods for implementing these operators; (3) determining methods to incorporate structure incrementally and flexibly; and (4) incorporating differential processing. A Personal Information History Assistant application serves as a test-bed for this research.

In market-basket data mining, a critical step is the finding of frequent itemsets (sets of items that are frequently bought in a single market basket). An efficient method for finding such itemsets along with their frequencies (number of occurrences in the dataset) is essential for enabling subsequent analysis of the data. Analogously, we believe counting frequent structures in semistructured data is a critical step in mining semistructured data. By frequent structures, we mean graphs that are isomorphic to a large number of subgraphs of the graph representing the dataset. (As is common, we use a labeled graph to represent semistructured data.) We have developed methods for counting such frequent structures that can handle data that is two orders of magnitude larger than the data handled by prior methods of which we are aware. We have implemented our methods in a system called SEuS, which we have made freely available.

Most data mining methods are sensitive to input parameters that must be carefully selected for each dataset (e.g., the support threshold for market-basket analysis). While the task of selecting these parameters is rarely easy, it is especially difficult for semistructured data because of the lack of a schema or known regular structure. Our method, as implemented in SEuS, depends on a single parameter, called the structure complexity metric (SCM), which encodes the preferred trade-off between frequent (but smaller) structures and larger (but less frequent) structures. (This parameter is necessary, since otherwise a structure consisting of a single node will always be the most frequent.) An interesting feature of SEuS is that it responds to changes in the SCM in interactive time (few seconds), permitting a user to quickly explore the effects of different parameter settings.

Publications and Products

Shayan Ghazizadeh and Sudarshan S. Chawathe, "SEuS: Structure Extraction using Summaries", (2002). In Proceedings of the 5th International Conference on Discovery Science. November 2002. Lubeck, Germany. Lecture Notes in Computer Science (LNCS) 2534.

Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo., "VQBD: Exploring Semistructured Data", (2001). In Proceedings of the ACM SIGMOD International Conference on Management of Data. Santa Barbara, California.

Feng Peng and Sudarshan S. Chawathe, "XPath Queries on Streaming Data", (2003). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 9-12 June 2003. San Diego, California.

Feng Peng and Sudarshan S. Chawathe, "XSQ: A Streaming XPath Engine", (2003). Technical Report. University of Maryland.

Sudarshan S. Chawathe, "Tracking Moving Clutches in Streaming Graphs", (2002). Technical Report CS-TR-4376 (UMIACS-TR-2002-56), University of Maryland, College Park. May 2002.

Sudarshan S. Chawathe, "Managing Historical XML Data", (2003). Advances in Computers volume 57, chapter 3, pages 109-169. Elsevier Science.

Shayan Ghazizadeh, Sudarshan S. Chawathe, "SEuS: Structure Extraction using Summaries", (2001). Technical Report CS-TR-4376 (UMIACS-TR-2002-56), University of Maryland, College Park, Nov 2001.

Sudarshan S. Chawathe, "Semistructured Data in Relational Databases", (2003). In Practical Handbook of Internet Computing. CRC Press.

Feng Peng and Sudarshan S. Chawathe, "XSQ: Streaming XPath Queries", (2003). In Proceedings of the IEEE International Conference on Data Engineering (ICDE). 5-8 March 2003. Bangalore, India.

Sudarshan S. Chawathe, "Tracking Hidden Groups Using Communications", (2003). In Proceedings of the NSF/NIJ Symposium on Intelligence and Security Informatics. Lecture Notes in Computer Science (LNCS) 2665. 2-3 June 2003. Tucson, Arizona.

Abheek Anand and Sudarshan S. Chawathe, "Disseminating Data in a Peer-to-Peer Network", (2003). Technical Report, University of Maryland. June 2003.

Feng Peng and Sudarshan S. Chawathe, "Streaming Evaluation of XPath Queries with Subqueries", (2003). Technical Report. University of Maryland. June 2003.

Akhil Gupta and Sudarshan S. Chawathe, "Skipping Streams Using XHints", (2003). Technical Report. University of Maryland. June 2003.

We have developed the SEuS system described above and released it as free software (under terms of the GNU General Public License (GPL). The source code and documentation is on the Web at http://www.cs.umd.edu/projects/seus/.

A related system we have developed is VQBD: The VQBD system addresses the problem of exploring large, unknown datasets (in XML form) very rapidly. It was demonstrated at the SIGMOD 2001 conference and has also been publicly released under the GNU GPL: http://www.cs.umd.edu/projects/vqbd/

As part of a sister project, we have developed the XSQ XPath query engine for streaming XML: http://www.cs.umd.edu/projects/xsq/.

Project Impact

This project has generated new methods for mining semistructured data, with a focus on handling data whose structure is not well known or easily described. Our methods have been published as well as implemented in the Seus system (link below). Another area of focus has been interactive exploration of semistructured data. Here, we have developed methods that provide very fast initial summaries of unknown datasets as well as permitting gradual increase in level of detail. The VQBD system (link below) implements some of these methods. Our work on a related project has also resulted in the XSQ XPath query engine for streaming XML data. In addition to our publications, all the above software is freely available.

Goals, Objectives and Targeted Activities

The goal of this research is to develop methods for knowledge discovery in historical semistructured data. While knowledge discovery is typically an interactive process, the lack of reliable schema information makes user interaction critical for semistructured data. To this end, we have taken a interaction-driven approach to knowledge discovery and implemented a data exploration system (called VQBD) that we are using as the front-end for our continuing work. In the coming year, we plan build on our method for efficient discovery of frequent structures in a semistructured database. In particular, we will develop higher-level data mining methods that use this method as a primitive. We will also continue to investigate further improvements to our structure discovery method and will continue to implement and test our work in the context of VQBD.

Area Background

Traditionally, databases have been very good in managing changes to data as long as the metadata (schema) remains unchanged. Changing the schema of a database is typically very difficult. Methods for schema evolution assume that changes to schema are infrequent. On the other hand, data in a Web environment tends to change rapidly in both content and schema. Our work in this project builds on recent work on semistructured databases. The recent book by Abiteboul, Buneman, and Suciu is a good starting point for exploring this area. There is a vast body of work in knowledge discovery; the recent book by Han and Kamber is a good starting reference.

Project Website

http://www.cs.umd.edu/projects/seus/
Seus is our testbed software for interactive data mining of semistructured data.

Project Website

http://www.cs.umd.edu/projects/vqbd/
VQBD is a system for exploration of semistructured XML data.

Project Website

http://www.cs.umd.edu/projects/xsq/
XSQ is a XPath query engine for streaming XML.