The vast majority of data used by scientists, engineers, and decision makers resides in a poorly structured collection of reports, memos, and other documents in a myriad of file formats. The increasing densities and falling prices of storage devices make it practical to store for perpetuity all such data that crosses a scientist's electronic desktop. The resulting Information History has the potential to serve as an intelligent assistant by detecting trends and patterns, suggesting potential collaborators, uncovering relevant documents and data from diverse sources, etc., resulting in dramatic increases in the effectiveness of information use. However, current technology, which focuses on either fully structured or completely unstructured databases, cannot be effectively adapted to extracting knowledge from a large historical semistructured database. The goal of the proposed research is to develop suitable formulations of the knowledge discovery problem for historical semistructured databases and to develop, implement, and evaluate solutions. The research thrusts are: (1) devising knowledge discovery operators for semistructured data and an algebra for them; (2) developing efficient methods for implementing these operators; (3) determining methods to incorporate structure incrementally and flexibly; and (4) incorporating differential processing. A Personal Information History Assistant application serves as a test-bed for this research.
In market-basket data mining, a critical step is the finding of frequent itemsets (sets of items that are frequently bought in a single market basket). An efficient method for finding such itemsets along with their frequencies (number of occurrences in the dataset) is essential for enabling subsequent analysis of the data. Analogously, we believe counting frequent structures in semistructured data is a critical step in mining semistructured data. By frequent structures, we mean graphs that are isomorphic to a large number of subgraphs of the graph representing the dataset. (As is common, we use a labeled graph to represent semistructured data.) We have developed methods for counting such frequent structures that can handle data that is two orders of magnitude larger than the data handled by prior methods of which we are aware. We have implemented our methods in a system called SEuS, which we have made freely available.
Most data mining methods are sensitive to input parameters that must be carefully selected for each dataset (e.g., the support threshold for market-basket analysis). While the task of selecting these parameters is rarely easy, it is especially difficult for semistructured data because of the lack of a schema or known regular structure. Our method, as implemented in SEuS, depends on a single parameter, called the structure complexity metric (SCM), which encodes the preferred trade-off between frequent (but smaller) structures and larger (but less frequent) structures. (This parameter is necessary, since otherwise a structure consisting of a single node will always be the most frequent.) An interesting feature of SEuS is that it responds to changes in the SCM in interactive time (few seconds), permitting a user to quickly explore the effects of different parameter settings.
Shayan Ghazizadeh and Sudarshan S. Chawathe, "SEuS: Structure Extraction using Summaries", (2002). In Proceedings of the 5th International Conference on Discovery Science. November 2002. Lubeck, Germany. Lecture Notes in Computer Science (LNCS) 2534.
Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo., "VQBD: Exploring Semistructured Data", (2001). In Proceedings of the ACM SIGMOD International Conference on Management of Data. Santa Barbara, California.
Feng Peng and Sudarshan S. Chawathe, "XPath Queries on Streaming Data", (2003). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 9-12 June 2003. San Diego, California.
Feng Peng and Sudarshan S. Chawathe, "XSQ: A Streaming XPath Engine", (2003). Technical Report. University of Maryland.
Sudarshan S. Chawathe, "Tracking Moving Clutches in Streaming Graphs", (2002). Technical Report CS-TR-4376 (UMIACS-TR-2002-56), University of Maryland, College Park. May 2002.
Sudarshan S. Chawathe, "Managing Historical XML Data", (2003). Advances in Computers volume 57, chapter 3, pages 109-169. Elsevier Science.
Shayan Ghazizadeh, Sudarshan S. Chawathe, "SEuS: Structure Extraction using Summaries", (2001). Technical Report CS-TR-4376 (UMIACS-TR-2002-56), University of Maryland, College Park, Nov 2001.
Sudarshan S. Chawathe, "Semistructured Data in Relational Databases", (2003). In Practical Handbook of Internet Computing. CRC Press.
Feng Peng and Sudarshan S. Chawathe, "XSQ: Streaming XPath Queries", (2003). In Proceedings of the IEEE International Conference on Data Engineering (ICDE). 5-8 March 2003. Bangalore, India.
Sudarshan S. Chawathe, "Tracking Hidden Groups Using Communications", (2003). In Proceedings of the NSF/NIJ Symposium on Intelligence and Security Informatics. Lecture Notes in Computer Science (LNCS) 2665. 2-3 June 2003. Tucson, Arizona.
Abheek Anand and Sudarshan S. Chawathe, "Disseminating Data in a Peer-to-Peer Network", (2003). Technical Report, University of Maryland. June 2003.
Feng Peng and Sudarshan S. Chawathe, "Streaming Evaluation of XPath Queries with Subqueries", (2003). Technical Report. University of Maryland. June 2003.
Akhil Gupta and Sudarshan S. Chawathe, "Skipping Streams Using XHints", (2003). Technical Report. University of Maryland. June 2003.
We have developed the SEuS system described above and released it as free software (under terms of the GNU General Public License (GPL). The source code and documentation is on the Web at http://www.cs.umd.edu/projects/seus/.
A related system we have developed is VQBD: The VQBD system addresses the problem of exploring large, unknown datasets (in XML form) very rapidly. It was demonstrated at the SIGMOD 2001 conference and has also been publicly released under the GNU GPL: http://www.cs.umd.edu/projects/vqbd/
As part of a sister project, we have developed the XSQ XPath query engine for streaming XML: http://www.cs.umd.edu/projects/xsq/.
This project has generated new methods for mining semistructured data, with a focus on handling data whose structure is not well known or easily described. Our methods have been published as well as implemented in the Seus system (link below). Another area of focus has been interactive exploration of semistructured data. Here, we have developed methods that provide very fast initial summaries of unknown datasets as well as permitting gradual increase in level of detail. The VQBD system (link below) implements some of these methods. Our work on a related project has also resulted in the XSQ XPath query engine for streaming XML data. In addition to our publications, all the above software is freely available.