Sudarshan S. Chawathe's Publications

The following list includes pointers to abstracts in text format, papers in PDF format, and citations in BibTeX format. If you're looking for something that isn't here, please send me email, and I'll be glad to get you a copy.

If your browser supports CSS then hovering on a title should display the abstract; if not, the abstract is only a click away.

Low-Latency Indoor Localization Using Bluetooth Beacons Indoor localization refers to the task of determining the location of a traveler in spaces (such as large building complexes or airport terminals) using coordinates appropriate to those spaces (such as floor and room number or airport terminal and gate). Indoor localization using Bluetooth beacons is attractive because of the low cost and high spatial selectivity of Bluetooth devices. However, a significant drawback of the Bluetooth protocol for this application is the large delay incurred in the discovery phase of the protocol, which is the phase used for detecting beacons. These delays, of approximately 20 seconds, hamper the use of this localization method because typical walking speeds are likely to change the set of potentially visible beacons part-way through the discovery phase. We study the causes of these delays and propose methods for alleviating them for indoor localization applications. We formalize the key problem of finding a minimum-cost complete beacon-probing plan and present an algorithm for generating such plans. Sudarshan S. Chawathe. In Proceedings of the 12th International IEEE Conference on Intelligent Transportation Systems (ITSC), St. Louis, Missouri, October 2009. To appear.
[ abstract | paper | citation ]
Effective Whitelisting for Filesystem Forensics. Forensic analysis of the large filesystems commonly found on current computers requires an effective method for categorizing and prioritizing files in order to avoid overwhelming the investigator. A key technique for this purpose is whitelisting files, i.e., skipping the detailed analysis of files that match files in a well known reference collection of files. Effective use of this technique requires an efficient method to match files, detecting not only exact matches, but also near matches or approximate matches. This paper outlines the requirements for such matching, formalizes them as the bounded best match and approximate bounded near-match problems, and describes methods to solve these problems. In particular, the approximate bounded near-match problem is mapped to the problem of finding near neighbors in a high-dimensional metric space and solved using locality-sensitive hashing. Sudarshan S. Chawathe. In Proceedings of the 7th IEEE Intelligence and Security Informatics Conference (ISI), Richardson, Texas, June 2009.
[ abstract | paper | citation ]
Beacon placement for indoor localization using Bluetooth. We describe a method for determining the location of a mobile device, such as a handheld computer or mobile phone, in an indoor environment using Bluetooth beacons. Since it uses inexpensive commodity devices, this method is inexpensive to deploy. The limited range of Bluetooth reception is used to advantage. Another important advantage of this method is that it allows the mobile device to determine its location while remaining anonymous, unidentified to the beacons or other nearby devices. In such a deployment, an important design task is the placement of beacons. Signal propagation in indoor environments is complex, affected by factors such as floor-plans and duct-work, varying transmission and reflection properties of building materials and furniture, and interference from other devices. Therefore, the area from which a beacon is visible is very irregular and not well approximated by simple models such as ellipsoids. Our solution permits complex reception characteristics to be accurately modeled and provides a simple method for choosing beacon locations. Sudarshan S. Chawathe. In Proceedings of the 11th International IEEE Conference on Intelligent Transportation Systems (ITSC), Beijing, China, October 2008.
[ abstract | paper | citation ]
Using dead drops to improve data dissemination in very sparse equipped traffic. We study the dissemination of data in a vehicular network in which the density of equipped vehicles, or all vehicles, is very low, and in which there is no centrally connected infrastructure, such as a network of gateways. In this environment, a vehicle is rarely in communication range of another vehicle, so that protocols based on forwarding messages along a chain of vehicles in communication range are not effective. Similarly, the lack of centrally connected network infrastructure precludes protocols that use a fixed network of gateways to transfer messages. Our solution is based on using a collection of stationary, stand-alone dead drops that exchange data with vehicles that pass by them. Although the dead drops do not communicate among themselves or with a central network, their limited ability to store and forward data can improve the connectivity of a vehicular network. We use simulation-based experiments to demonstrate that the use of dead drops results in significant improvements in both the rate of data dissemination and the robustness to frequent changes in the set of participating vehicles. Sudarshan S. Chawathe. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Eindhoven, Netherlands, June 2008.
[ abstract | paper | citation ]
Marker-based localizing for indoor navigation. We present a method for assigning markers to locations to enable navigation by observing short sequences of markers. A motivating application is indoor navigation, where pedestrians can determine their location using the sequence of recently encountered markers as they walk along hallways in a building. While we may solve this problem simply by assigning a unique marker to each location, such a solution limits the granularity of localization due to limits on the number of distinct markers. We present a more efficient solution that uses a sequence of recently seen markers to determine locations. We demonstrate that our solution yields significant improvements over the naive solution even when the marker-sequences are quite short, consisting of only three of four markers each. Sudarshan S. Chawathe. In Proceedings of the 10th International IEEE Conference on Intelligent Transportation Systems (ITSC), Seattle, Washington, October 2007.
[ abstract | paper | citation ]
Protecting transportation infrastructure. Protecting transportation infrastructure is an important component of today's homeland security effort. This article highlights related technical challenges and provides a brief survey of several IT research streams in this important application area, along with two case studies. Daniel Zeng, Sudarshan S. Chawathe, Hua Huang, and Fei-Yue Wang. IEEE Intelligent Systems, September/October 2007.
[ abstract | paper | citation ]
Segment-based map matching. We address the problem of determining the path of a vehicle on a given vector map of roads, based on tracking data such as that obtained from onboard GPS receivers. We describe a method that is based on a piecewise matching of track segments to map features. A notable feature of our method is that it is applicable to a large class of existing methods. We discuss metrics for evaluating the output of map-matching methods and briefly describe our implementation of a map-matching system based on our methods. Sudarshan S. Chawathe. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Istanbul, Turkey, June 2007.
[ abstract | paper | citation ]
Interimistic Data Dissemination. We present low-overhead protocols for disseminating streaming data in an interimistic environment in which the availability of computers, networks, and other resources is very unpredictable. An important example of such an environment is the state of computing and communication resources immediately following a natural disaster or a similar disruptive event. A key characteristic of such environments is the continual change in network membership and topology. Our protocols are based on using the intentional and extensional overlap among the data requirements of network nodes. We describe our testbed, CoDD, that implements these protocols and provide an experimental evaluation of the techniques. Sudarshan S. Chawathe and Abheek Anand. Information Systems and e-Business Management (ISeB), 5(3):229--253, June 2007.
[ abstract | paper | citation ]
Organizing hot-spot police patrol routes. We address the problem of planning patrol routes to maximize coverage of important locations (hot spots) at minimum cost (length of patrol route). We model a road network using an edge-weighted graph in which edges represent streets, vertices represent intersections, and weights represent importance of the corresponding streets. We describe efficient methods that use this input to determine the most important patrol routes. In addition to the importance of streets (edge weights), important routes are affected by the topology of the road network. Our methods permit automation of a labor-intensive stage of the patrol-planning process and aid dynamic adjustment of patrol routes in response to changes in the input graph (as a result of a developing situation, for instance). Sudarshan S. Chawathe. In Proceedings of the 5th IEEE Intelligence and Security Informatics Conference (ISI), pages 78-85, New Brunswick, New Jersey, May 2007.
[ abstract | paper | citation ]
Strategic Web-Service Agreements. This paper addresses issues of strategy in Web-service composition, and inter-site collaboration in general. The results are useful for both human and artificial agents of a site who are responsible for determining Web-service agreements that are most profitable for that site. We discuss three specific questions in this area. (1) How should the profit resulting from a composition of Web-services be divided among the participants? (2) How can we counter the risk of service-providers misrepresenting their services in an attempt to gain a larger share of the profit? (3) Are stable configurations guaranteed or feasible when each service-provider's decisions on how to collaborate (permit compositions) are guided solely by the goal of maximizing its profit? Sudarshan S. Chawathe. In Proceedings of the 4th IEEE International Conference on Web Services (ICWS), Chicago, Illinois, September 2006.
[ abstract | paper | citation ]
Inter-Vehicle Data Dissemination in Sparse Equipped Traffic. We address the following question: How can data be disseminated in an inter-vehicle communications (IVC) network when the traffic of IVC-equipped vehicles is very sparse (e.g., one IVC-equipped vehicle every few minutes), using minimal extra-vehicular infrastructure? We present a data-dissemination scheme that uses wireless dead drops (dead letter boxes) as intermediaries between vehicles. These dead drops are simple transceivers with storage and are not connected to other dead drops, the Internet, or other networks. An important question is the placement of such dead drops. We formulate the corresponding optimization problem and prove that it is NP-hard. We present an optimal greedy approximation algorithm for solving the optimization problem. Our algorithm is based on mapping the dead-drop placement problem to the problem of computing a minimum-weight spanning sub-hypergraph. Sudarshan S. Chawathe. In Proceedings of the 9th IEEE International Conference on Intelligent Transportation Systems (ITSC), Toronto, Canada, September 2006.
[ abstract | paper | citation ]
Tracking Changes in Healthcare Documents. We present methods for monitoring a large, diverse, and autonomously modified collection of healthcare documents on the Web. Our methods do not require document-providers to offer any special services. They are based on explicating changes between document versions on a per-user basis by using differencing algorithms. These changes are presented to users in the context of the documents using special XML elements. In order to effectively browse changes in large document collections, we use a variable-resolution XML browser. A noteworthy feature of this browser is that it produces usable displays at any level of detail specified by a user. Sudarshan S. Chawathe. In Proceedings of the 19th IEEE International Conference on Computer-Based Medical Systems (CBMS), pages 137-142, Salt Lake City, Utah, June 2006.
[ abstract | paper | citation ]
Distributing the Cost of Securing a Transportation Infrastructure. We address the problem of fairly distributing the cost of system-wide improvements to the security of a transportation infrastructure over the beneficiaries. We present a framework that models transportation links and the emergence (magnitude and frequency) and propagation of threats. The cost-distribution is based on a weighted sum that characterizes the expected reduction in the vulnerability of a site as a result of the security improvements. Sudarshan S. Chawathe. In Proceedings of the 4th IEEE Intelligence and Security Informatics Conference (ISI), pages 596-601, San Diego, California, May 2006. (Also appears in the IEEE Intelligent Transporation Systems Society Newsletter,
[ abstract | paper | citation ]
Fair Policies for Travel on Neighborhood Streets. The residents of each street in a neighborhood can improve their travel times by forming agreements with the residents of other streets to permit mutual thoroughfare. However, this benefit comes with the cost of additional neighborhood traffic. The key problem addressed by this paper is that of determining a policy that is fair to each street's residents' desires to minimize their travel time by using neighborhood streets while also minimizing traffic on their street. We model this problem using a street graph and apply game theoretic methods in order to characterize solutions. Sudarshan S. Chawathe. In Proceedings of the 8th IEEE International Conference on Intelligent Transportation Systems (ITSC) , pages 1027-1032, Vienna, Austria, September 2005.
[ abstract | paper | citation ]
Differencing Data Streams. We present external-memory algorithms for differencing large hierarchical datasets. Our methods are especially suited to streaming data with bounded differences. For input sizes m and n and maximum output (difference) size e, the I/O, RAM, and CPU costs of our algorithm rdiff are, respectively, m+n, 4e+8, and O(MN). That is, given 4e+8 blocks of RAM, our algorithm performs no I/O operations other than those required to read both inputs. We also present a variant of the algorithm that uses only four blocks of RAM, with I/O cost 8me + 18m + n + 6e + 5 and CPU cost O(MN). Sudarshan S. Chawathe. In Proceedings of the International Database Engineering and Applications Symposium (IDEAS), pages 273-284,
[ abstract | paper | citation ]
XSQ: A Streaming XPath Engine. We have implemented and released the XSQ system for evaluating XPath queries on streaming XML data. XSQ supports XPath features such as multiple predicates, closures, and aggregation, which pose interesting challenges for streaming evaluation. Our implementation is based on using a hierarchical arrangement of augmented finite state automata. A design goal of XSQ is buffering data for the least amount of time possible. We present a detailed experimental study that characterizes the performance of XSQ and related systems, and that illustrates the performance implications of XPath features such as closures. Feng Peng and Sudarshan S. Chawathe. ACM Transactions on Database Systems (TODS), 30(2):577-623, June 2005.
[ abstract | paper | citation ]
Data Management in Interimistic Environments. We address the problem of managing information in an interimistic environment following a disruptive event that results in widespread failures of computers, networks, and other resources. We motivate the problem and discuss the design requirements. As a concrete example of such information management, we present our work on data dissemination in an interimistic environment in which both network membership and data characteristics are subject to continual change. Our methods use low-overhead protocols to determine a suitable method for routing a stream of data. We also present a brief experimental evaluation of our methods. Abheek Anand and Sudarshan S. Chawathe. In Proceedings of the Third Workshop on E-Business (WeB), Washington, D.C., December
[ abstract | paper | citation ]
Real-Time Traffic-Data Analysis. We describe a method for real-time monitoring of data generated by traffic sensors. We provide a system architecture and discuss three key components: (1) a streaming query processor that is used to reduce the volume of data; (2) a pattern-matching module that is used to detect when a developing traffic situation resembles one flagged earlier; and (3) an interface that efficiently displays the stream of sensor data in a user-configurable manner. Sudarshan S. Chawathe. In Proceedings of the 7th IEEE International Conference on Intelligent Transportation Systems (ITSC), pages 112-117, Washington, D.C., October 2004.
[ abstract | paper | citation ]
Managing RFID Data. Radio-Frequency Identification (RFID) technology enables sensors to efficiently and in- expensively track merchandise and other objects. The vast amount of data resulting from the proliferation of RFID readers and tags poses some interesting challenges for data management. We present a brief introduction to RFID technology and highlight a few of the data management challenges. Sudarshan S. Chawathe, Venkat Krishnamurthy, Sridhar Ramachandran, and Sanjay Sarma. In Proceedings of the International Converence on Very Large Data Bases (VLDB), pages 1189-1195, Toronto, Canada, August 2004.
[ abstract | paper | citation ]
Control of Personal Location Data. Several recent technological developments have contributed to the ability of individ- uals and organizations to monitor people at a level of detail previously reserved for science-fiction movies. Global-positioning system (GPS) devices, video cameras, RFID tags, sensors [19], and other devices that permit fine-grained monitoring of objects and people are now produced with mass-market efficiencies and very low prices. Concurrent improvements in the capabilities and prices of networking and mass storage devices have made the collection and perpetual storage of detailed monitoring data not only feasible, but quite inexpensive. In fact, in most cases there are few, if any, technological reasons for ever destroying data collected in this manner. Sudarshan S. Chawathe. In Proceedings of the Location Privacy Workshop . Schoodic Peninsula, Maine. August 2004.
[ abstract | paper | citation ]
Privacy-Preserving Inter-Database Operations. We present protocols for distributed computation of relational intersections and equi-joins such that each site gains no information about the tuples at the other site that do not intersect or join with its own tuples. Such protocols form the building blocks of distributed information systems that manage sensitive information, such as patient records and financial transactions, that must be shared in only a limited manner. We discuss applications of our protocols, outlining the ramifications of assumptions such as semi-honesty. In addition to improving on the efficiency of earlier protocols, our protocols are asymmetric, making them especially applicable to applications in which a low-powered client interacts with a server in a privacy-preserving manner. We present a brief experimental study of our protocols. Gang Liang and Sudarshan S. Chawathe. In Proceedings of the Second Symposium on Intelligence and Security Informatics (ISI) . Tucson, Arizona. June 2004.
[ abstract | paper | citation ]
Privacy-Preserving Inter-Database Operations. We present protocols for distributed computation of relational intersections and equi-joins such that each site gains no information about the tuples at the other site that do not intersect or join with its own tuples. Such protocols form the building blocks of distributed information systems that manage sensitive information, such as patient records and financial transactions, that must be shared in only a limited manner. We discuss applications of our protocols, outlining the ramifications of assumptions such as semi-honesty. In addition to improving on the efficiency of earlier protocols, our protocols are asymmetric, making them especially applicable to applications in which a low-powered client interacts with a server in a privacy-preserving manner. We present a brief experimental study of our protocols. Gang Liang and Sudarshan S. Chawathe. Technical Report CS-TR-4564 (UMIACS-TR-2004-09). Computer Science Department, University of Maryland, College Park, Maryland 20742. February 2004.
[ abstract | paper | citation ]
Skipping Streams with XHints. When streaming semi-structured data is processed by a well-designed query processor, parsing constitutes a significant portion of the running time. Further improvements in performance therefore require some method to overcome the high cost of parsing. We have designed a general-purpose mechanism by which a producer of streaming data may augment the data stream with hints that permit a downstream processor to skip parsing parts of the stream. Inserting such hints requires additional processing by the producer of data; however, the resulting stream is more valuable to consumers, making such processing worthwhile. In this paper, we focus on hints that are designed to improve the throughput of a streaming XML query engine. We present a set of hint schemes and describe how a query engine can improve its performance by taking advantage of the hints. Finally, we demonstrate the benefits of our approach using an experimental study. Akhil Gupta and Sudarshan S. Chawathe. Technical Report CS-TR-4566 (UMIACS-TR-2004-11) Computer Science Department, University of Maryland, College Park, Maryland 20742. February 2004.
[ abstract | paper | citation ]
Cooperative Data Dissemination in a Serverless Environment. We describe the design and implementation of CoDD, a system for cooperative data dissemination in a serverless environment. CoDD allows a set of loosely-coupled, distributed peer nodes to receive subsets of a data stream, by describing interests using subscription queries. CoDD maintains statistical information on the characteristics of data and queries, and uses it to organize nodes in a hierarchical, data-aware topology. Data is disseminated using overlays which are created to try to minimize the excess data flowing through the system, while maintaining low latency and satisfying fanout constraints. CoDD is designed to allow nodes to determine individual degrees of contribution to the system, and to adapt gracefully to temporal changes in the data distribution using an adaptive reorganization component. We present the results of our experimental evaluation of CoDD. Abheek Anand and Sudarshan S. Chawathe. Technical Report CS-TR-4562 (UMIACS-TR-2004-07) Computer Science Department, University of Maryland, College Park, Maryland 20742. February 2004.
[ abstract | paper | citation ]
Streaming XPath Subquery Evaluation. We describe a method for the streaming evaluation of XPath queries that have subqueries in pred- icates. Our method rewrites XPath queries into a set of predicate-free labeled linear-form expressions (LFEs). These LFEs are used to generate a push-down transducer that enables efficient management of a buffer and hierarchical index at runtime. To the best of our knowledge, our method is the first to support XPath subqueries in a streaming environment. Our method also provides optimal buffering, minimum-latency output, and optimal predicate evaluation. The method has been fully implemented and publicly released in the XSQ system. We present an experimental study of XSQ and related systems on both real-life and synthetic datasets, and investigate how subqueries and other features affect the performance of these systems. Feng Peng and Sudarshan S. Chawathe. Technical Report CS-TR-4376 (UMIACS-TR-2002-56). Computer Science Department, University of Maryland, College Park, Maryland 20742. January 2004.
[ abstract | paper | citation ]
Semistructured Data in Relational Databases. Semistructured data is data whose structure is irregular, incomplete, and frequently changing. Traditional database methods are ill-suited to storing and querying such data because they rely on a well-defined structure (schema). While such data occurs in several formats, our focus is on the XML format. We present several alternatives for storing semistructured data in relational databases and discuss their merits and limitations. We also discuss the methods used in commercial database management systems from IBM, Oracle, Microsoft, and Sybase. Throughout, we use a concrete running example to illustrate the features of semistructured data, explain the methods, clarify their features, and motivate further improvements. Sudarshan S. Chawathe. Chapter 25 in Practical Handbook of Internet Computing. Edited by Munindar P. Singh. CRC Press. 2004.
[ abstract | citation ]
Optimal Buffering for Streaming XPath Evaluation. We motivate and present a definition of optimal buffering for streaming evaluation of XPath queries. We consider a large fragment of XPath that includes multiple (correlated) subqueries and reverse (up the document tree) axes. We describe a method for XPath evaluation with optimal buffering. We present the results of an experimental evaluation of our methods based on our implementation, which is freely available. Feng Peng and Sudarshan S. Chawathe. Technical Report CS-TR-4376 (UMIACS-TR-2002-56). Computer Science Department, University of Maryland, College Park, Maryland 20742. January 2004.
[ abstract | paper | citation ]
XPath Queries on Streaming Data. We present the design and implementation of the XSQ system for querying streaming XML data using XPath 1.0. Using a clean design based on a hierarchical arrangement of pushdown transducers augmented with buffers, XSQ supports features such as multiple predicates, closures, and aggregation. XSQ not only provides high throughput, but is also memory efÂ£cient: It buffers only data that must be buffered by any streaming XPath processor. We also present an empirical study of the performance characteristics of XPath features, as embodied by XSQ and several other systems. Feng Peng and Sudarshan S. Chawathe. In Proceedings of the ACM SIGMOD International Conference on Management of Data. June 9-12, 2003. San Diego, California.
[ abstract | paper | citation ]
Tracking Hidden Groups using Communications. We address the problem of tracking a group of agents based on their communications over a network when the network devices used for communication (e.g., phones for telephony, IP addresses for the Internet) change continually. We present a system design and describe our work on its key modules. Our methods are based on detecting frequent patterns in graphs and on visual exploration of large amounts of raw and processed data using a zooming interface. Sudarshan S. Chawathe. In Proceedings of the NSF/NIJ Symposium on Intelligence and Security Informatics (ISI). Tucson, Arizona. June 2003.
[ abstract | paper | citation ]
Efficient Peer-to-Peer Searches Using Result-Caching. Existing peer-to-peer systems implement a single function well: data lookup. There is now a wealth of research describing how to reliably disseminate, and to later retrieve, data in a scalable and load-balanced manner. However, searching has received less attention. The current state of the art is to distribute inverted indexes in the name space. Intersection of distributed sets can be made more efficient by exchanging bloom filters prior to moving objects. This paper proposes an orthogonal and complementary technique: using result-caching to avoid duplicating work and data movement. The main contribution of the paper is a new data structure, the view tree, that can be used to efficiently store and retrieve such prior results. These results, which can also be thought of as materialized views, can then be used to efficiently answer future queries. Note that object attributes could either be derived from application semantics (e.g. meta-data from files in a filesystem) or computed via techniques such as latent semantic indexing. Bobby Bhattacharjee, Sudarshan Chawathe, Vijay Gopalkrishnan, Pete Keleher, and Bujor Silaghi. In Proceedings of The International Workshop on Peer-to-Peer Systems (IPTPS). Berkeley, California. February 2003.
[ abstract | paper | citation ]
Managing Historical Semistructured Data. The Extensible Markup Language (XML) provides a standard, text-based serialization for hierarchically structured tagged data. An important feature of XML is that the hierarchical tag structure is controlled by the source of the data, not the standard. This feature permits each database to be presented in the form that best suits it, and greatly simplifies the task of publishing data. Consequently, XML data has proliferated in both quantity and variety. The very flexibility in structure that simplifies data publishing also complicates the task of effective storage of XML data, since traditional methods rely on the presence of a rigid structure (schema). For historical databases, older versions of the database must remain accessible as the database evolves. Efficient data retrieval in the face of such an ever-growing data store presents additional challenges. In this chapter, we describe several methods for storing historical XML data, ranging from simple software layers on top of standard relational databases to solutions tailored specifically to historical XML. We discuss methods from the research literature and use a running example that aids in comparing the methods. Sudarshan S. Chawathe. Chapter 3 in Advances in Computers , Volume 57. Edited by Marvin Zelkowitz. Elsevier Science. 2003.
[ abstract | citation ]
Tracking Moving Clutches in Streaming Graphs. We address the problem of tracking groups of interacting entities as they move in a graph with vertices representing hosts or locations and edges representing interactions between hosts. The graph of interactions is modeled as a stream of edges (with the arrival of an edge signifying an interaction between the hosts it connects). This problem arises in applications such as tracking groups of fraudulent callers in a telephone network and tracking identities of malicious agents (programs or people) on a data network. We present a formalization of this problem and a streaming solution that uses bounded storage and provides real-time response. Our solution is based on maintaining, at each instant, an approximation of the streaming graph seen so far. We present empirical results to quantify the effectiveness of our solution. Sudarshan S. Chawathe. Technical Report CS-TR-4376 (UMIACS-TR-2002-56). Computer Science Department, University of Maryland, College Park, Maryland 20742. May 2002.
[ abstract | paper | citation ]
SEuS: Structure Extraction using Summaries. We study the problem of finding frequent structures in semistructured data (represented as a directed labeled graph). Frequent structures are graphs that are isomorphic to a large number of subgraphs in the data graph. Frequent structures form building blocks for visual exploration and data mining of semistructured data. We overcome the inherent computational complexity of the problem by using a summary data structure to prune the search space and to provide interactive feedback. We present an experimental study of our methods operating on real datasets. The implementation of our methods is capable of operating on datasets that are two to three orders of magnitude larger than those described in prior work. Shayan Ghazizadeh and Sudarshan S. Chawathe. In Proceedings of the The 5th International Conference on Discovery Science. Lubeck, Germany, November 2002.
[ abstract | paper | citation ]
Discovering Frequent Structures using Summaries. We study the problem of finding frequent structures in semistructured data (represented as a directed labeled graph). Frequent structures are graphs that are isomorphic to a large number of subgraphs in the data graph. Frequent structures form building blocks for visual exploration and data mining of semistructured data. We overcome the inherent computational complexity of the problem by using a summary data structure to prune the search space and to provide interactive feedback. We present an experimental study of our methods operating on real datasets. The implementation of our methods (which is freely available) is capable of operating on datasets that are two to three orders of magnitude larger than those described in prior work. Shayan Ghazizadeh and Sudarshan S. Chawathe. Technical Report CS-TR-4364. Computer Science Department, University of Maryland. College Park, Maryland. November 2001.
[ abstract | paper | citation ]
XSQ: Streaming XPath Queries. We demonstrate XSQ, a streaming XPath query processor. XSQ is the first system for querying XML data streams that supports XPath features such as closures, aggregations, and multiple predicates. XSQ is also very efficient. As an example, simply parsing an XML stream derived from the well-known DBLP dataset requires 297 seconds. To evaluate the XPath query //article[@key]//title/text() over it, XSQ requires only 84 seconds of additional time. The design of XSQ is based on a hierarchical arrangement of pushdown transducers augmented with queues; this clean design is easy to understand, prove correct, and expand to more complex queries. Feng Peng and Sudarshan S. Chawathe. In Proceedings of the IEEE International Conference on Data Engineering. Bangalore, India. March 2003. Demonstration description.
[ abstract | paper | citation ]
VQBD: Exploring Semistructured Data. The VQBD ("vee cubed") project addresses the following problem: What is the best way to explore an XML document of unknown structure and content? The VQBD system we demonstrate has the following notable features: (1) The sub-tasks of exploration, viz., visualization, querying, and browsing, are complementary. (2) The level of detail presented to the user scales smoothly over a wide range. (3) There is no required structure assumed of the data. (4) Any available explicit structure is effectively used. (5) Implicit structure is detected and used. (6) It is easy to impose and use structure at run time. Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo. In Proceedings of the ACM SIGMOD International Conference on Management of Data, page 603, Santa Barbara, California. May 2001. Demonstration description.
[ abstract | paper | citation ]
VQBD: Visualizing, Querying, and Browsing Semistructured Data. The VQBD project addresses the following problem: What is the best way to explore an XML document of unknown structure and content? We focus on XML documents that are too large to browse in their entirety, even with the assistance of pretty-printing software e.g., multi-megabyte or larger XML documents. In this context, we use the term data exploration to refer to the process by which a user gathers the information needed to use the data for a speci c purpose e.g., generating a report, writing queries, building user interfaces, writing applications. In a relational or object database, the schema e.g., table definitions, class definitions, integrity constraints, and stored procedures provides some of the information necessary for writing queries and applications. However, the schema is rarely sufficient for these tasks. Typically, one must probe and browse the database to discover data coverage, typical and exceptional values, and other information required to gain a better understanding of the database. In an XML environment, the need for such data exploration is much greater because it is quite likely that the XML data of interest is not accompanied by a schema. Indeed, much XML data is semistructured, meaning its structure is irregular, incomplete, and frequently changing. The rapid adoption of XML as a data exchange standard makes this semistructured data exploration problem increasingly important. Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo. Extended version of demonstration description. http://cs.umaine.edu/~chaw/.
[ abstract | paper | citation ]
Context-Sensitive Search and Exploration of XML Text. XML permits documents with arbitrary nested context (tag structure). We investigate how this context may be used to aid the task of searching and exploring XML text. We describe the design and implementation of the Cextor system, which includes a context-sensitive search engine. Cextor also includes a novel technique for organizing and exploring very large search results based on context. A distinguishing feature of this technique is that it does not assume search results are of modest size. Rather, it is designed to cope with search results that are potentially the size of the database. We present the results of an experimental evaluation of Cextor on data derived from the Web. Thomas Baby and Sudarshan S. Chawathe. Technical Report. Computer Science Department, University of Maryland, College Park, Maryland 20742. May 2001.
[ abstract | paper | citation ]
Describing and Manipulating XML Data. This paper presents a brief overview of data management using the Extensible Markup Language (XML). It presents the basics of XML and the DTDs used to constrain XML data, and describes metadata management using RDF. It also discusses how XML data is queried, referenced, and transformed using stylesheet language XSLT and referencing mechanisms XPath and XPointer. Sudarshan S. Chawathe. Bulletin of the IEEE Technical Committee on Data Engineering, 22(3):3-9, 1999.
[ abstract | paper | citation ]
Comparing hierarchical data in external memory. We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, and O(MN + (M+N)S^1.5 ), where M and N are the input tree sizes, S is the block size, m = M/S, and n = N/S. This algorithm can make effective use of surplus RAM capacity to quadratically reduce I/O cost. We extend to trees the commonly used mapping from sequence comparison problems to shortest-path problems in edit graphs. Sudarshan S.Chawathe. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases, pages 90-101, Edinburgh, Scotland, U.K., September 1999.
[ abstract | paper | citation ]
Managing historical semistructured data. Semistructured data may be irregular and incomplete and does not necessarily conform to a fixed schema. As with structured data, it is often desirable to maintain a history of changes to data, and to query over both the data and the changes. Representing and querying changes in semistructured data is more difficult than in structured data due to the irregularity and lack of schema. We present a model for representing changes in semistructured data and a language for querying over these changes. An important feature of our approach is that we represent and query changes directly as annotations on the affected data, instead of indirectly as the difference between database states. We describe the implementation of our model and query language. We present extensions that permit convenient snapshot-based access in our change-based data model. We also describe our design and implementation of a query subscription service that permits users to subscribe to changes in semistructured information sources. Sudarshan S. Chawathe, Serge Abiteboul, and Jennifer Widom. Theory and Practice of Object Systems, 5(3):143-162, August 1999.
[ abstract | paper | citation ]
Managing Change in Heterogeneous Autonomous Databases. Information relevant to a task at hand is often scattered across a collection of heterogeneous, autonomous databases. Individual databases in such a collection are owned and managed by independent, and often competing, entities that cooperate to only a limited extent. For example, the collection of databases used in the construction of a building includes databases owned by the architect, the construction company, the electrical contractor, and so on. Such autonomous database collections are also common on the Internet. For example, the collection of Internet databases with information about San Francisco consists of databases operated by several competing entities. Making effective use of such collections of heterogeneous, autonomous databases presents several challenges due to the absence of traditional database facilities such as locks, transactions, and standard query languages. In particular, understanding and controlling how such databases evolve is an important problem that traditional database techniques are ill-equipped to address. Sudarshan S. Chawathe. Ph.D. Thesis, Stanford University, 1999.
[ abstract | paper | citation ]
Representing and querying changes in semistructured data. Semistructured data may be irregular and incomplete and does not necessarily conform to a fixed schema. As with structured data, it is often desirable to maintain a history of changes to data, and to query over both the data and the changes. Representing and querying changes in semistructured data is more difficult than in structured data due to the irregularity and lack of schema. We present a model for representing changes in semistructured data and a language for querying over these changes. An important feature of our approach is that we represent and query changes directly as annotations on the affected data, instead of indirectly as the difference between database states. We describe the implementation of our model and query language. We also describe the design and implementation of a query subscription service that permits users to subscribe to changes in semistructured information sources. Sudarshan S. Chawathe, Serge Abiteboul, and Jennifer Widom. In Proceedings of the International Conference on Data Engineering , Orlando, Florida, February 1998.
[ abstract | paper | citation ]
Meaningful change detection in structured data. Detecting changes by comparing data snapshots is an important requirement for difference queries, active databases, and version and configuration management. In this paper we focus on detecting meaningful changes in hierarchically structured data, such as nested-object data. This problem is much more challenging than the corresponding one for relational or flat-file data. In order to describe changes better, we base our work not just on the traditional "atomic" insert, delete, update operations, but also on operations that move an entire sub-tree of nodes, and that copy an entire sub-tree. These operations allows us to describe changes in a semantically more meaningful way. Since this change detection problem is NP-hard, in this paper we present a heuristic change detection algorithm that yields close to "minimal" descriptions of the changes, and that has fewer restrictions than previous algorithms. Our algorithm is based on transforming the change detection problem to a problem of computing a minimum-cost edge cover of a bipartite graph. We study the quality of the solution produced by our algorithm, as well as the running time, both analytically and experimentally. Sudarshan S. Chawathe and Hector Garcia-Molina. In Proceedings of the ACM SIGMOD International Conference on Management of Data , pages 26-37, Tucson, Arizona, May 1997.
[ abstract | paper | extended version | citation ]
Representative objects: Concise representations of semistructured, hierarchical data. In this paper we introduce the representative object, which uncovers the inherent schema(s) in semistructured, hierarchical data sources and provides a concise description of the structure of the data. Semistructured data, unlike data stored in typical relational or object-oriented databases, does not have fixed schema that is known in advance and stored separately from the data. With the rapid growth of the World Wide Web, semistructured hierarchical data sources are becoming widely available to the casual user. The lack of external schema information currently makes browsing and querying these data sources inefficient at best, and impossible at worst. We show how representative objects make schema discovery efficient and facilitate the generation of meaningful queries over the data. Svetlozar Nestorov, Jeffrey D. Ullman, Janet Wiener, and Sudarshan S. Chawathe. In Proceedings of the International Conference on Data Engineering , pages 79-90, Birmingham, U.K., 1997.
[ abstract | paper | citation ]
Change Management in Heterogeneous, Semistructured Databases. S. Chawathe, V. Gossain, X. Liu, J. Widom, and S. Abiteboul. Technical Report, Stanford University Database Group, 1997. paper; citation;
An expressive model for comparing tree-structured data. S. Chawathe and H. Garcia-Molina. Technical Report, Stanford University Database Group, 1997. paper; citation.
Change detection in hierarchically structured information. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. In Proceedings of the ACM SIGMOD International Conference on Management of Data , pages 493-504, Montreal, Quebec, June 1996. paper; citation; Extended version.
A toolkit for constraint management in heterogeneous information systems. S. Chawathe, H. Garcia-Molina, and J. Widom. In Proceedings of the International Conference on Data Engineering , pages 56-65, New Orleans, Louisiana, 1996. paper; citation; Extended version.
A standard textual interchange format for the Object Exchange Model (OEM). R. Goldman, S. Chawathe, A. Crespo, and J. McHugh Technical Report, Stanford University Database Group, 1996. paper; citation.
Flexible constraint management for autonomous distributed databases. S. Chawathe, H. Garcia-Molina, and J. Widom. Invited paper. Bulletin of the IEEE Technical Committee on Data Engineering, 17(2):23-27, 1994. paper; citation.
The Tsimmis Project: Integration of Heterogeneous Information Sources. S. Chawathe, H. Garcia-Molina, J. Hammer, Y. Papakonstantinou, J. Ullman, and J. Widom. Invited paper. In Proceedings of 100th Anniversary Meeting of the Information Processing Society of Japan , pages 7-18, Tokyo, Japan, October 1994. paper; citation.
On index selection schemes for nested object hierarchies. S. Chawathe, M-S. Chen, and P. Yu. In Proceedings of the International Conference on Very Large Data Bases , pages 331-341, Santiago, Chile, September 1994. paper; citation; Extended version.

Sudarshan S. Chawathe