The following list includes pointers to
abstracts in text format, papers in PDF format, and citations in
BibTeX format. If you're looking for something that isn't here,
please send me email, and I'll be glad to get you a copy.
If your browser supports CSS then hovering on a
title should display the abstract; if not, the abstract is only a
click away.
-
Low-Latency Indoor Localization Using
Bluetooth Beacons Indoor localization
refers to the task of determining the location of a traveler in
spaces (such as large building complexes or airport terminals)
using coordinates appropriate to those spaces (such as floor and
room number or airport terminal and gate). Indoor localization
using Bluetooth beacons is attractive because of the low cost
and high spatial selectivity of Bluetooth devices. However, a
significant drawback of the Bluetooth protocol for this
application is the large delay incurred in the discovery phase
of the protocol, which is the phase used for detecting beacons.
These delays, of approximately 20 seconds, hamper the use of
this localization method because typical walking speeds are
likely to change the set of potentially visible beacons part-way
through the discovery phase. We study the causes of these
delays and propose methods for alleviating them for indoor
localization applications. We formalize the key problem of
finding a minimum-cost complete beacon-probing plan and present
an algorithm for generating such plans. Sudarshan
S. Chawathe. In Proceedings of the 12th International IEEE
Conference on Intelligent Transportation Systems (ITSC),
St. Louis, Missouri, October 2009. To appear.
-
Effective Whitelisting for Filesystem
Forensics.
Forensic analysis of the large filesystems commonly found on
current computers requires an effective method for
categorizing and prioritizing files in order to avoid
overwhelming the investigator. A key technique for this
purpose is whitelisting files, i.e., skipping the detailed
analysis of files that match files in a well known reference
collection of files. Effective use of this technique
requires an efficient method to match files, detecting not
only exact matches, but also near matches or approximate
matches. This paper outlines the requirements for such
matching, formalizes them as the bounded best match and
approximate bounded near-match problems, and describes
methods to solve these problems. In particular, the
approximate bounded near-match problem is mapped to the
problem of finding near neighbors in a high-dimensional
metric space and solved using locality-sensitive hashing.
Sudarshan S. Chawathe.
In Proceedings of the 7th IEEE Intelligence and Security
Informatics Conference (ISI), Richardson, Texas,
June 2009.
-
Beacon placement for indoor
localization using Bluetooth. We
describe a method for determining the location of a mobile
device, such as a handheld computer or mobile phone, in an
indoor environment using Bluetooth beacons. Since it uses
inexpensive commodity devices, this method is inexpensive to
deploy. The limited range of Bluetooth reception is used to
advantage. Another important advantage of this method is that
it allows the mobile device to determine its location while
remaining anonymous, unidentified to the beacons or other
nearby devices. In such a deployment, an important design
task is the placement of beacons. Signal propagation in
indoor environments is complex, affected by factors such as
floor-plans and duct-work, varying transmission and
reflection properties of building materials and furniture,
and interference from other devices. Therefore, the area from
which a beacon is visible is very irregular and not well
approximated by simple models such as ellipsoids. Our
solution permits complex reception characteristics to be
accurately modeled and provides a simple method for choosing
beacon locations. Sudarshan S. Chawathe. In
Proceedings of the 11th International IEEE Conference on
Intelligent Transportation Systems (ITSC), Beijing,
China, October 2008.
-
Using dead drops to improve data
dissemination in very sparse equipped traffic.
We study the dissemination of data in
a vehicular network in which the density of equipped
vehicles, or all vehicles, is very low, and in which there is
no centrally connected infrastructure, such as a network of
gateways. In this environment, a vehicle is rarely in
communication range of another vehicle, so that protocols
based on forwarding messages along a chain of vehicles in
communication range are not effective. Similarly, the lack of
centrally connected network infrastructure precludes
protocols that use a fixed network of gateways to transfer
messages. Our solution is based on using a collection of
stationary, stand-alone dead drops that exchange data with
vehicles that pass by them. Although the dead drops do not
communicate among themselves or with a central network, their
limited ability to store and forward data can improve the
connectivity of a vehicular network. We use simulation-based
experiments to demonstrate that the use of dead drops results
in significant improvements in both the rate of data
dissemination and the robustness to frequent changes in the
set of participating vehicles. Sudarshan S.
Chawathe. In Proceedings of the IEEE Intelligent Vehicles
Symposium (IV), Eindhoven, Netherlands, June 2008.
-
Marker-based localizing for indoor
navigation. We present a method
for assigning markers to locations to enable navigation by
observing short sequences of markers. A motivating
application is indoor navigation, where pedestrians can
determine their location using the sequence of recently
encountered markers as they walk along hallways in a
building. While we may solve this problem simply by assigning
a unique marker to each location, such a solution limits the
granularity of localization due to limits on the number of
distinct markers. We present a more efficient solution that
uses a sequence of recently seen markers to determine
locations. We demonstrate that our solution yields
significant improvements over the naive solution even when
the marker-sequences are quite short, consisting of only
three of four markers each. Sudarshan S. Chawathe.
In Proceedings of the 10th International IEEE Conference
on Intelligent Transportation Systems (ITSC), Seattle,
Washington, October 2007.
-
Protecting transportation
infrastructure. Protecting
transportation infrastructure is an important component of
today's homeland security effort. This article highlights
related technical challenges and provides a brief survey of
several IT research streams in this important application
area, along with two case studies. Daniel Zeng,
Sudarshan S. Chawathe, Hua Huang, and Fei-Yue Wang. IEEE
Intelligent Systems, September/October 2007.
-
Segment-based map matching.
We address the problem of determining
the path of a vehicle on a given vector map of roads, based
on tracking data such as that obtained from onboard GPS
receivers. We describe a method that is based on a piecewise
matching of track segments to map features. A notable feature
of our method is that it is applicable to a large class of
existing methods. We discuss metrics for evaluating the
output of map-matching methods and briefly describe our
implementation of a map-matching system based on our
methods. Sudarshan S. Chawathe. In Proceedings
of the IEEE Intelligent Vehicles Symposium (IV),
Istanbul, Turkey, June 2007.
-
Interimistic Data
Dissemination. We present
low-overhead protocols for disseminating streaming data in an
interimistic environment in which the availability of
computers, networks, and other resources is very
unpredictable. An important example of such an environment is
the state of computing and communication resources
immediately following a natural disaster or a similar
disruptive event. A key characteristic of such environments
is the continual change in network membership and topology.
Our protocols are based on using the intentional and
extensional overlap among the data requirements of network
nodes. We describe our testbed, CoDD, that implements these
protocols and provide an experimental evaluation of the
techniques. Sudarshan S. Chawathe and Abheek
Anand. Information Systems and e-Business Management
(ISeB), 5(3):229--253, June 2007.
-
Organizing hot-spot police patrol
routes. We address the problem of
planning patrol routes to maximize coverage of important
locations (hot spots) at minimum cost (length of patrol
route). We model a road network using an edge-weighted graph
in which edges represent streets, vertices represent
intersections, and weights represent importance of the
corresponding streets. We describe efficient methods that use
this input to determine the most important patrol routes. In
addition to the importance of streets (edge weights),
important routes are affected by the topology of the road
network. Our methods permit automation of a labor-intensive
stage of the patrol-planning process and aid dynamic
adjustment of patrol routes in response to changes in the
input graph (as a result of a developing situation, for
instance). Sudarshan S. Chawathe. In
Proceedings of the 5th IEEE Intelligence and Security
Informatics Conference (ISI), pages 78-85, New
Brunswick, New Jersey, May 2007.
-
Strategic Web-Service
Agreements. This paper addresses
issues of strategy in Web-service composition, and inter-site
collaboration in general. The results are useful for both
human and artificial agents of a site who are responsible for
determining Web-service agreements that are most profitable
for that site. We discuss three specific questions in this
area. (1) How should the profit resulting from a composition
of Web-services be divided among the participants? (2) How
can we counter the risk of service-providers misrepresenting
their services in an attempt to gain a larger share of the
profit? (3) Are stable configurations guaranteed or feasible
when each service-provider's decisions on how to collaborate
(permit compositions) are guided solely by the goal of
maximizing its profit? Sudarshan S. Chawathe. In
Proceedings of the 4th IEEE International Conference on
Web Services (ICWS), Chicago, Illinois, September 2006.
-
Inter-Vehicle Data Dissemination
in Sparse Equipped Traffic.
We address the following question: How can data be disseminated
in an inter-vehicle communications (IVC) network when the
traffic of IVC-equipped vehicles is very sparse (e.g., one
IVC-equipped vehicle every few minutes), using minimal
extra-vehicular infrastructure? We present a data-dissemination
scheme that uses wireless dead drops (dead letter boxes) as
intermediaries between vehicles. These dead drops are simple
transceivers with storage and are not connected to other dead
drops, the Internet, or other networks. An important question
is the placement of such dead drops. We formulate the
corresponding optimization problem and prove that it is NP-hard.
We present an optimal greedy approximation algorithm for solving
the optimization problem. Our algorithm is based on mapping the
dead-drop placement problem to the problem of computing a
minimum-weight spanning sub-hypergraph. Sudarshan
S. Chawathe. In Proceedings of the 9th IEEE International
Conference on Intelligent Transportation Systems (ITSC),
Toronto, Canada, September 2006.
-
Tracking Changes in Healthcare
Documents. We present methods for
monitoring a large, diverse, and autonomously modified
collection of healthcare documents on the Web. Our methods do
not require document-providers to offer any special services.
They are based on explicating changes between document
versions on a per-user basis by using differencing
algorithms. These changes are presented to users in the
context of the documents using special XML elements. In order
to effectively browse changes in large document collections,
we use a variable-resolution XML browser. A noteworthy
feature of this browser is that it produces usable displays
at any level of detail specified by a user.
Sudarshan S. Chawathe. In Proceedings of the 19th IEEE
International Conference on Computer-Based Medical Systems
(CBMS), pages 137-142, Salt Lake City, Utah, June 2006.
-
Distributing the Cost of Securing
a Transportation Infrastructure. We address the problem of fairly distributing the
cost of system-wide improvements to the security of a
transportation infrastructure over the beneficiaries. We
present a framework that models transportation links and the
emergence (magnitude and frequency) and propagation of
threats. The cost-distribution is based on a weighted sum
that characterizes the expected reduction in the
vulnerability of a site as a result of the security
improvements. Sudarshan S. Chawathe. In
Proceedings of the 4th IEEE Intelligence and Security
Informatics Conference (ISI), pages 596-601, San Diego,
California, May 2006. (Also appears in the IEEE
Intelligent Transporation Systems Society Newsletter,
-
Fair Policies for Travel on
Neighborhood Streets. The
residents of each street in a neighborhood can improve their
travel times by forming agreements with the residents of
other streets to permit mutual thoroughfare. However, this
benefit comes with the cost of additional neighborhood
traffic. The key problem addressed by this paper is that of
determining a policy that is fair to each street's residents'
desires to minimize their travel time by using neighborhood
streets while also minimizing traffic on their street. We
model this problem using a street graph and apply game
theoretic methods in order to characterize
solutions. Sudarshan S. Chawathe. In
Proceedings of the 8th IEEE International Conference on
Intelligent Transportation Systems (ITSC) , pages
1027-1032, Vienna, Austria, September 2005.
-
Differencing Data Streams.
We present external-memory algorithms
for differencing large hierarchical datasets. Our methods are
especially suited to streaming data with bounded differences.
For input sizes m and n and maximum output (difference) size
e, the I/O, RAM, and CPU costs of our algorithm rdiff are,
respectively, m+n, 4e+8, and O(MN). That is, given 4e+8
blocks of RAM, our algorithm performs no I/O operations other
than those required to read both inputs. We also present a
variant of the algorithm that uses only four blocks of RAM,
with I/O cost 8me + 18m + n + 6e + 5 and CPU cost
O(MN). Sudarshan S. Chawathe. In Proceedings
of the International Database Engineering and Applications
Symposium (IDEAS), pages 273-284,
-
XSQ: A Streaming XPath
Engine. We have implemented and
released the XSQ system for evaluating XPath queries on
streaming XML data. XSQ supports XPath features such as
multiple predicates, closures, and aggregation, which pose
interesting challenges for streaming evaluation. Our
implementation is based on using a hierarchical arrangement
of augmented finite state automata. A design goal of XSQ is
buffering data for the least amount of time possible. We
present a detailed experimental study that characterizes the
performance of XSQ and related systems, and that illustrates
the performance implications of XPath features such as
closures. Feng Peng and Sudarshan S. Chawathe.
ACM Transactions on Database Systems (TODS),
30(2):577-623, June 2005.
-
Data Management in Interimistic
Environments. We address the
problem of managing information in an interimistic
environment following a disruptive event that results in
widespread failures of computers, networks, and other
resources. We motivate the problem and discuss the design
requirements. As a concrete example of such information
management, we present our work on data dissemination in an
interimistic environment in which both network membership and
data characteristics are subject to continual change. Our
methods use low-overhead protocols to determine a suitable
method for routing a stream of data. We also present a brief
experimental evaluation of our methods. Abheek
Anand and Sudarshan S. Chawathe. In Proceedings of the
Third Workshop on E-Business (WeB), Washington, D.C.,
December
-
Real-Time Traffic-Data
Analysis. We describe a method for
real-time monitoring of data generated by traffic sensors. We
provide a system architecture and discuss three key
components: (1) a streaming query processor that is used to
reduce the volume of data; (2) a pattern-matching module that
is used to detect when a developing traffic situation
resembles one flagged earlier; and (3) an interface that
efficiently displays the stream of sensor data in a
user-configurable manner. Sudarshan S. Chawathe.
In Proceedings of the 7th IEEE International Conference
on Intelligent Transportation Systems (ITSC), pages
112-117, Washington, D.C., October 2004.
-
Managing RFID Data.
Radio-Frequency Identification (RFID)
technology enables sensors to efficiently and in- expensively
track merchandise and other objects. The vast amount of data
resulting from the proliferation of RFID readers and tags
poses some interesting challenges for data management. We
present a brief introduction to RFID technology and highlight
a few of the data management challenges. Sudarshan
S. Chawathe, Venkat Krishnamurthy, Sridhar Ramachandran, and
Sanjay Sarma. In Proceedings of the International
Converence on Very Large Data Bases (VLDB), pages
1189-1195, Toronto, Canada, August 2004.
-
Control of Personal Location
Data. Several recent technological
developments have contributed to the ability of individ- uals
and organizations to monitor people at a level of detail
previously reserved for science-fiction movies.
Global-positioning system (GPS) devices, video cameras, RFID
tags, sensors [19], and other devices that permit
fine-grained monitoring of objects and people are now
produced with mass-market efficiencies and very low prices.
Concurrent improvements in the capabilities and prices of
networking and mass storage devices have made the collection
and perpetual storage of detailed monitoring data not only
feasible, but quite inexpensive. In fact, in most cases there
are few, if any, technological reasons for ever destroying
data collected in this manner. Sudarshan S.
Chawathe. In Proceedings of the Location Privacy
Workshop . Schoodic Peninsula, Maine. August 2004.
-
Privacy-Preserving Inter-Database
Operations. We present protocols
for distributed computation of relational intersections and
equi-joins such that each site gains no information about the
tuples at the other site that do not intersect or join with
its own tuples. Such protocols form the building blocks of
distributed information systems that manage sensitive
information, such as patient records and financial
transactions, that must be shared in only a limited manner.
We discuss applications of our protocols, outlining the
ramifications of assumptions such as semi-honesty. In
addition to improving on the efficiency of earlier protocols,
our protocols are asymmetric, making them especially
applicable to applications in which a low-powered client
interacts with a server in a privacy-preserving manner. We
present a brief experimental study of our
protocols. Gang Liang and Sudarshan S. Chawathe.
In Proceedings of the Second Symposium on Intelligence
and Security Informatics (ISI) . Tucson, Arizona. June
2004.
-
Privacy-Preserving
Inter-Database Operations. We
present protocols for distributed computation of relational
intersections and equi-joins such that each site gains no
information about the tuples at the other site that do not
intersect or join with its own tuples. Such protocols form
the building blocks of distributed information systems that
manage sensitive information, such as patient records and
financial transactions, that must be shared in only a limited
manner. We discuss applications of our protocols, outlining
the ramifications of assumptions such as semi-honesty. In
addition to improving on the efficiency of earlier protocols,
our protocols are asymmetric, making them especially
applicable to applications in which a low-powered client
interacts with a server in a privacy-preserving manner. We
present a brief experimental study of our
protocols. Gang Liang and Sudarshan S. Chawathe.
Technical Report CS-TR-4564 (UMIACS-TR-2004-09). Computer
Science Department, University of Maryland, College Park,
Maryland 20742. February 2004.
-
Skipping Streams with
XHints. When streaming
semi-structured data is processed by a well-designed query
processor, parsing constitutes a significant portion of the
running time. Further improvements in performance therefore
require some method to overcome the high cost of parsing. We
have designed a general-purpose mechanism by which a producer
of streaming data may augment the data stream with hints that
permit a downstream processor to skip parsing parts of the
stream. Inserting such hints requires additional processing
by the producer of data; however, the resulting stream is
more valuable to consumers, making such processing
worthwhile. In this paper, we focus on hints that are
designed to improve the throughput of a streaming XML query
engine. We present a set of hint schemes and describe how a
query engine can improve its performance by taking advantage
of the hints. Finally, we demonstrate the benefits of our
approach using an experimental study. Akhil Gupta
and Sudarshan S. Chawathe. Technical Report CS-TR-4566
(UMIACS-TR-2004-11) Computer Science Department, University
of Maryland, College Park, Maryland 20742. February 2004.
-
Cooperative Data Dissemination
in a Serverless Environment. We
describe the design and implementation of CoDD, a system for
cooperative data dissemination in a serverless environment.
CoDD allows a set of loosely-coupled, distributed peer nodes
to receive subsets of a data stream, by describing interests
using subscription queries. CoDD maintains statistical
information on the characteristics of data and queries, and
uses it to organize nodes in a hierarchical, data-aware
topology. Data is disseminated using overlays which are
created to try to minimize the excess data flowing through
the system, while maintaining low latency and satisfying
fanout constraints. CoDD is designed to allow nodes to
determine individual degrees of contribution to the system,
and to adapt gracefully to temporal changes in the data
distribution using an adaptive reorganization component. We
present the results of our experimental evaluation of
CoDD. Abheek Anand and Sudarshan S. Chawathe.
Technical Report CS-TR-4562 (UMIACS-TR-2004-07) Computer
Science Department, University of Maryland, College Park,
Maryland 20742. February 2004.
-
Streaming XPath Subquery
Evaluation. We describe a method
for the streaming evaluation of XPath queries that have
subqueries in pred- icates. Our method rewrites XPath queries
into a set of predicate-free labeled linear-form expressions
(LFEs). These LFEs are used to generate a push-down
transducer that enables efficient management of a buffer and
hierarchical index at runtime. To the best of our knowledge,
our method is the first to support XPath subqueries in a
streaming environment. Our method also provides optimal
buffering, minimum-latency output, and optimal predicate
evaluation. The method has been fully implemented and
publicly released in the XSQ system. We present an
experimental study of XSQ and related systems on both
real-life and synthetic datasets, and investigate how
subqueries and other features affect the performance of these
systems. Feng Peng and Sudarshan S. Chawathe.
Technical Report CS-TR-4376 (UMIACS-TR-2002-56). Computer
Science Department, University of Maryland, College Park,
Maryland 20742. January 2004.
-
Semistructured Data in
Relational Databases. Semistructured data is data whose structure is
irregular, incomplete, and frequently changing. Traditional
database methods are ill-suited to storing and querying such
data because they rely on a well-defined structure (schema).
While such data occurs in several formats, our focus is on
the XML format. We present several alternatives for storing
semistructured data in relational databases and discuss their
merits and limitations. We also discuss the methods used in
commercial database management systems from IBM, Oracle,
Microsoft, and Sybase. Throughout, we use a concrete running
example to illustrate the features of semistructured data,
explain the methods, clarify their features, and motivate
further improvements. Sudarshan S. Chawathe.
Chapter 25 in Practical Handbook of Internet
Computing. Edited by Munindar P. Singh. CRC Press. 2004.
-
Optimal Buffering for Streaming
XPath Evaluation. We motivate and
present a definition of optimal buffering for streaming
evaluation of XPath queries. We consider a large fragment of
XPath that includes multiple (correlated) subqueries and
reverse (up the document tree) axes. We describe a method for
XPath evaluation with optimal buffering. We present the
results of an experimental evaluation of our methods based on
our implementation, which is freely available.
Feng Peng and Sudarshan S. Chawathe. Technical Report
CS-TR-4376 (UMIACS-TR-2002-56). Computer Science Department,
University of Maryland, College Park, Maryland 20742. January
2004.
-
XPath Queries on Streaming
Data. We present the design and
implementation of the XSQ system for querying streaming XML
data using XPath 1.0. Using a clean design based on a
hierarchical arrangement of pushdown transducers augmented
with buffers, XSQ supports features such as multiple
predicates, closures, and aggregation. XSQ not only provides
high throughput, but is also memory ef£cient: It
buffers only data that must be buffered by any streaming
XPath processor. We also present an empirical study of the
performance characteristics of XPath features, as embodied by
XSQ and several other systems. Feng Peng and
Sudarshan S. Chawathe. In Proceedings of the ACM SIGMOD
International Conference on Management of Data. June
9-12, 2003. San Diego, California.
-
Tracking Hidden Groups using
Communications. We address the
problem of tracking a group of agents based on their
communications over a network when the network devices used
for communication (e.g., phones for telephony, IP addresses
for the Internet) change continually. We present a system
design and describe our work on its key modules. Our methods
are based on detecting frequent patterns in graphs and on
visual exploration of large amounts of raw and processed data
using a zooming interface. Sudarshan S. Chawathe.
In Proceedings of the NSF/NIJ Symposium on Intelligence
and Security Informatics (ISI). Tucson, Arizona. June
2003.
-
Efficient Peer-to-Peer
Searches Using Result-Caching. Existing peer-to-peer systems implement a single
function well: data lookup. There is now a wealth of research
describing how to reliably disseminate, and to later
retrieve, data in a scalable and load-balanced manner.
However, searching has received less attention. The current
state of the art is to distribute inverted indexes in the
name space. Intersection of distributed sets can be made more
efficient by exchanging bloom filters prior to moving
objects. This paper proposes an orthogonal and complementary
technique: using result-caching to avoid duplicating work and
data movement. The main contribution of the paper is a new
data structure, the view tree, that can be used to
efficiently store and retrieve such prior results. These
results, which can also be thought of as materialized views,
can then be used to efficiently answer future queries. Note
that object attributes could either be derived from
application semantics (e.g. meta-data from files in a
filesystem) or computed via techniques such as latent
semantic indexing. Bobby Bhattacharjee, Sudarshan
Chawathe, Vijay Gopalkrishnan, Pete Keleher, and Bujor
Silaghi. In Proceedings of The International Workshop on
Peer-to-Peer Systems (IPTPS). Berkeley, California.
February 2003.
-
Managing Historical
Semistructured Data. The
Extensible Markup Language (XML) provides a standard,
text-based serialization for hierarchically structured tagged
data. An important feature of XML is that the hierarchical
tag structure is controlled by the source of the data, not
the standard. This feature permits each database to be
presented in the form that best suits it, and greatly
simplifies the task of publishing data. Consequently, XML
data has proliferated in both quantity and variety. The very
flexibility in structure that simplifies data publishing also
complicates the task of effective storage of XML data, since
traditional methods rely on the presence of a rigid structure
(schema). For historical databases, older versions of the
database must remain accessible as the database evolves.
Efficient data retrieval in the face of such an ever-growing
data store presents additional challenges. In this chapter,
we describe several methods for storing historical XML data,
ranging from simple software layers on top of standard
relational databases to solutions tailored specifically to
historical XML. We discuss methods from the research
literature and use a running example that aids in comparing
the methods. Sudarshan S. Chawathe. Chapter 3 in
Advances in Computers , Volume 57. Edited by Marvin
Zelkowitz. Elsevier Science. 2003.
-
Tracking Moving Clutches in
Streaming Graphs. We address the
problem of tracking groups of interacting entities as they
move in a graph with vertices representing hosts or locations
and edges representing interactions between hosts. The graph
of interactions is modeled as a stream of edges (with the
arrival of an edge signifying an interaction between the
hosts it connects). This problem arises in applications such
as tracking groups of fraudulent callers in a telephone
network and tracking identities of malicious agents (programs
or people) on a data network. We present a formalization of
this problem and a streaming solution that uses bounded
storage and provides real-time response. Our solution is
based on maintaining, at each instant, an approximation of
the streaming graph seen so far. We present empirical results
to quantify the effectiveness of our solution.
Sudarshan S. Chawathe. Technical Report CS-TR-4376
(UMIACS-TR-2002-56). Computer Science Department, University
of Maryland, College Park, Maryland 20742. May 2002.
-
SEuS: Structure Extraction
using Summaries. We study the
problem of finding frequent structures in semistructured data
(represented as a directed labeled graph). Frequent
structures are graphs that are isomorphic to a large number
of subgraphs in the data graph. Frequent structures form
building blocks for visual exploration and data mining of
semistructured data. We overcome the inherent computational
complexity of the problem by using a summary data structure
to prune the search space and to provide interactive
feedback. We present an experimental study of our methods
operating on real datasets. The implementation of our methods
is capable of operating on datasets that are two to three
orders of magnitude larger than those described in prior
work. Shayan Ghazizadeh and Sudarshan S. Chawathe.
In Proceedings of the The 5th International Conference on
Discovery Science. Lubeck, Germany, November 2002.
-
Discovering Frequent Structures
using Summaries. We study the
problem of finding frequent structures in semistructured data
(represented as a directed labeled graph). Frequent
structures are graphs that are isomorphic to a large number
of subgraphs in the data graph. Frequent structures form
building blocks for visual exploration and data mining of
semistructured data. We overcome the inherent computational
complexity of the problem by using a summary data structure
to prune the search space and to provide interactive
feedback. We present an experimental study of our methods
operating on real datasets. The implementation of our methods
(which is freely available) is capable of operating on
datasets that are two to three orders of magnitude larger
than those described in prior work. Shayan
Ghazizadeh and Sudarshan S. Chawathe. Technical Report
CS-TR-4364. Computer Science Department, University of
Maryland. College Park, Maryland. November 2001.
-
XSQ: Streaming XPath
Queries. We demonstrate XSQ, a
streaming XPath query processor. XSQ is the first system for
querying XML data streams that supports XPath features such
as closures, aggregations, and multiple predicates. XSQ is
also very efficient. As an example, simply parsing an XML
stream derived from the well-known DBLP dataset requires 297
seconds. To evaluate the XPath query
//article[@key]//title/text() over it, XSQ requires only 84
seconds of additional time. The design of XSQ is based on a
hierarchical arrangement of pushdown transducers augmented
with queues; this clean design is easy to understand, prove
correct, and expand to more complex queries. Feng
Peng and Sudarshan S. Chawathe. In Proceedings of the
IEEE International Conference on Data Engineering.
Bangalore, India. March 2003. Demonstration description.
-
VQBD: Exploring Semistructured
Data. The VQBD ("vee cubed")
project addresses the following problem: What is the best way
to explore an XML document of unknown structure and
content? The VQBD system we demonstrate has the
following notable features: (1) The sub-tasks of exploration,
viz., visualization, querying, and browsing, are
complementary. (2) The level of detail presented to the user
scales smoothly over a wide range. (3) There is no required
structure assumed of the data. (4) Any available explicit
structure is effectively used. (5) Implicit structure is
detected and used. (6) It is easy to impose and use structure
at run time. Sudarshan S. Chawathe, Thomas Baby,
and Jihwang Yeo. In Proceedings of the ACM SIGMOD
International Conference on Management of Data, page
603, Santa Barbara, California. May 2001. Demonstration
description.
-
VQBD: Visualizing, Querying,
and Browsing Semistructured Data. The VQBD project addresses the following problem:
What is the best way to explore an XML document of unknown
structure and content? We focus on XML documents that are too
large to browse in their entirety, even with the assistance
of pretty-printing software e.g., multi-megabyte or larger
XML documents. In this context, we use the term data
exploration to refer to the process by which a user gathers
the information needed to use the data for a speci c purpose
e.g., generating a report, writing queries, building user
interfaces, writing applications. In a relational or object
database, the schema e.g., table definitions, class
definitions, integrity constraints, and stored procedures
provides some of the information necessary for writing
queries and applications. However, the schema is rarely
sufficient for these tasks. Typically, one must probe and
browse the database to discover data coverage, typical and
exceptional values, and other information required to gain a
better understanding of the database. In an XML environment,
the need for such data exploration is much greater because it
is quite likely that the XML data of interest is not
accompanied by a schema. Indeed, much XML data is
semistructured, meaning its structure is irregular,
incomplete, and frequently changing. The rapid adoption of
XML as a data exchange standard makes this semistructured
data exploration problem increasingly important.
Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo. Extended
version of demonstration description. http://cs.umaine.edu/~chaw/.
-
Context-Sensitive Search and
Exploration of XML Text. XML
permits documents with arbitrary nested context (tag
structure). We investigate how this context may be used to
aid the task of searching and exploring XML text. We describe
the design and implementation of the Cextor system, which
includes a context-sensitive search engine. Cextor also
includes a novel technique for organizing and exploring very
large search results based on context. A distinguishing
feature of this technique is that it does not assume search
results are of modest size. Rather, it is designed to cope
with search results that are potentially the size of the
database. We present the results of an experimental
evaluation of Cextor on data derived from the Web.
Thomas Baby and Sudarshan S. Chawathe. Technical Report.
Computer Science Department, University of Maryland, College
Park, Maryland 20742. May 2001.
-
Describing and
Manipulating XML Data. This paper
presents a brief overview of data management using the
Extensible Markup Language (XML). It presents the basics of
XML and the DTDs used to constrain XML data, and describes
metadata management using RDF. It also discusses how XML data
is queried, referenced, and transformed using stylesheet
language XSLT and referencing mechanisms XPath and
XPointer. Sudarshan S. Chawathe. Bulletin of
the IEEE Technical Committee on Data Engineering,
22(3):3-9, 1999.
-
Comparing hierarchical data in
external memory. We present an
external-memory algorithm for computing a minimum-cost edit
script between two rooted, ordered, labeled trees. The I/O,
RAM, and CPU costs of our algorithm are, respectively,
4mn+7m+5n, 6S, and O(MN + (M+N)S^1.5 ), where M and N are the
input tree sizes, S is the block size, m = M/S, and n = N/S.
This algorithm can make effective use of surplus RAM capacity
to quadratically reduce I/O cost. We extend to trees the
commonly used mapping from sequence comparison problems to
shortest-path problems in edit graphs. Sudarshan
S.Chawathe. In Proceedings of the Twenty-fifth
International Conference on Very Large Data Bases, pages
90-101, Edinburgh, Scotland, U.K., September 1999.
-
Managing historical
semistructured data. Semistructured data may be irregular and
incomplete and does not necessarily conform to a fixed
schema. As with structured data, it is often desirable to
maintain a history of changes to data, and to query over both
the data and the changes. Representing and querying changes
in semistructured data is more difficult than in structured
data due to the irregularity and lack of schema. We present a
model for representing changes in semistructured data and a
language for querying over these changes. An important
feature of our approach is that we represent and query
changes directly as annotations on the affected data, instead
of indirectly as the difference between database states. We
describe the implementation of our model and query language.
We present extensions that permit convenient snapshot-based
access in our change-based data model. We also describe our
design and implementation of a query subscription service
that permits users to subscribe to changes in semistructured
information sources. Sudarshan S. Chawathe, Serge
Abiteboul, and Jennifer Widom. Theory and Practice of
Object Systems, 5(3):143-162, August 1999.
-
Managing Change in Heterogeneous
Autonomous Databases. Information
relevant to a task at hand is often scattered across a
collection of heterogeneous, autonomous databases. Individual
databases in such a collection are owned and managed by
independent, and often competing, entities that cooperate to
only a limited extent. For example, the collection of
databases used in the construction of a building includes
databases owned by the architect, the construction company,
the electrical contractor, and so on. Such autonomous
database collections are also common on the Internet. For
example, the collection of Internet databases with
information about San Francisco consists of databases
operated by several competing entities. Making effective use
of such collections of heterogeneous, autonomous databases
presents several challenges due to the absence of traditional
database facilities such as locks, transactions, and standard
query languages. In particular, understanding and controlling
how such databases evolve is an important problem that
traditional database techniques are ill-equipped to
address. Sudarshan S. Chawathe. Ph.D. Thesis,
Stanford University, 1999.
-
Representing and querying changes
in semistructured data. Semistructured data may be irregular and
incomplete and does not necessarily conform to a fixed
schema. As with structured data, it is often desirable to
maintain a history of changes to data, and to query over both
the data and the changes. Representing and querying changes
in semistructured data is more difficult than in structured
data due to the irregularity and lack of schema. We present a
model for representing changes in semistructured data and a
language for querying over these changes. An important
feature of our approach is that we represent and query
changes directly as annotations on the affected data, instead
of indirectly as the difference between database states. We
describe the implementation of our model and query language.
We also describe the design and implementation of a query
subscription service that permits users to subscribe to
changes in semistructured information sources.
Sudarshan S. Chawathe, Serge Abiteboul, and Jennifer Widom.
In Proceedings of the International Conference on Data
Engineering , Orlando, Florida, February 1998.
-
Meaningful change detection in
structured data. Detecting changes
by comparing data snapshots is an important requirement for
difference queries, active databases, and version and
configuration management. In this paper we focus on detecting
meaningful changes in hierarchically structured data, such as
nested-object data. This problem is much more challenging
than the corresponding one for relational or flat-file data.
In order to describe changes better, we base our work not
just on the traditional "atomic" insert, delete, update
operations, but also on operations that move an entire
sub-tree of nodes, and that copy an entire sub-tree. These
operations allows us to describe changes in a semantically
more meaningful way. Since this change detection problem is
NP-hard, in this paper we present a heuristic change
detection algorithm that yields close to "minimal"
descriptions of the changes, and that has fewer restrictions
than previous algorithms. Our algorithm is based on
transforming the change detection problem to a problem of
computing a minimum-cost edge cover of a bipartite graph. We
study the quality of the solution produced by our algorithm,
as well as the running time, both analytically and
experimentally. Sudarshan S. Chawathe and Hector
Garcia-Molina. In Proceedings of the ACM SIGMOD
International Conference on Management of Data , pages
26-37, Tucson, Arizona, May 1997.
-
Representative objects:
Concise representations of semistructured, hierarchical
data. In this paper we introduce
the representative object, which uncovers the inherent
schema(s) in semistructured, hierarchical data sources and
provides a concise description of the structure of the data.
Semistructured data, unlike data stored in typical relational
or object-oriented databases, does not have fixed schema that
is known in advance and stored separately from the data. With
the rapid growth of the World Wide Web, semistructured
hierarchical data sources are becoming widely available to
the casual user. The lack of external schema information
currently makes browsing and querying these data sources
inefficient at best, and impossible at worst. We show how
representative objects make schema discovery efficient and
facilitate the generation of meaningful queries over the
data. Svetlozar Nestorov, Jeffrey D. Ullman, Janet
Wiener, and Sudarshan S. Chawathe. In Proceedings of the
International Conference on Data Engineering , pages
79-90, Birmingham, U.K., 1997.
- Change Management in Heterogeneous,
Semistructured Databases. S. Chawathe, V. Gossain, X.
Liu, J. Widom, and S. Abiteboul. Technical Report, Stanford
University Database Group, 1997. paper; citation;
- An expressive model for comparing
tree-structured data. S. Chawathe and
H. Garcia-Molina. Technical Report, Stanford University
Database Group, 1997. paper;
citation.
- Change detection in hierarchically
structured information. S. Chawathe,
A. Rajaraman, H. Garcia-Molina, and J. Widom. In
Proceedings of the ACM SIGMOD International Conference on
Management of Data , pages 493-504, Montreal, Quebec, June
1996. paper; citation; Extended version.
- A toolkit for constraint management in
heterogeneous information systems. S. Chawathe,
H. Garcia-Molina, and J. Widom. In Proceedings of
the International Conference on Data Engineering , pages
56-65, New Orleans, Louisiana, 1996. paper; citation; Extended version.
- A standard textual interchange format for
the Object Exchange Model (OEM). R. Goldman, S. Chawathe,
A. Crespo, and J. McHugh Technical Report, Stanford University
Database Group, 1996. paper;
citation.
- Flexible constraint management for
autonomous distributed databases. S. Chawathe,
H. Garcia-Molina, and J. Widom. Invited paper.
Bulletin of the IEEE Technical Committee on Data
Engineering, 17(2):23-27, 1994. paper; citation.
- The Tsimmis Project: Integration of
Heterogeneous Information Sources. S. Chawathe,
H. Garcia-Molina, J. Hammer,
Y. Papakonstantinou, J. Ullman, and J. Widom.
Invited paper. In Proceedings of 100th Anniversary Meeting
of the Information Processing Society of Japan , pages
7-18, Tokyo, Japan, October 1994. paper; citation.
- On index selection schemes for nested object
hierarchies. S. Chawathe, M-S. Chen, and
P. Yu. In Proceedings of the International Conference
on Very Large Data Bases , pages 331-341, Santiago, Chile,
September 1994. paper;
citation; Extended version.
Sudarshan S. Chawathe