-
Human Occupancy From Few Diverse
Sensors. Sudarshan S. Chawathe. In Proceedings of the IEEE
World AI IoT Congress (AIIoT 2024), Seattle, Washington, May29-31
2024.
This paper addresses the task of estimating the number of human
occupants in rooms using data from a small number of sensors,
of different kinds, placed in those rooms. In some environments,
such as homes, offices, and small stores, the maximum possible
number of concurrent human occupants is small (say, 10) and
therefore the task may be fruitfully modeled as classification
problem. Using robust human-understandable classification
methods yields classifiers that provide higher accuracies than
those in prior work.
Occupancy Estimation; Smart Buildings; Classification; Data
Integration; Data Visualization.
-
Feeding a Chariot of Fire and Lightning:
Pricing Fuel-Electric Vehicle Operations. Sudarshan S.
Chawathe. In Proceedings of the 32nd Annual Harold W. Borns
Symposium, April10-11 2024.
Vehicles that may be run using combinations of fuel and
electricity present the obvious questions of which combinations
are to be preferred for given metrics (such as environmental or
financial costs) and trip characteristics (such as length,
topography, and weather). This note frames this problem,
outlines some challenges, and proposes a data-drvien solution
that avoids the difficulties of model-based approaches.
-
Pragmatic Domestic Electrical Load
Disaggregation. Sudarshan S. Chawathe. In Proceedings of
the 14th IEEE Annual Computing and Communication Workshop and
Conference (IEEE CCWC 2023), Las Vegas, Nevada, March8-11 2023.
To appear.
This paper studies methods for determining the major electrical
loads that contribute to aggregate electric energy consumption
for a household or similar unit. It provides an alternate
formulation of a well studied general problem and a framework
and prototype implementation to address it. The focus is on
methods that do not require any instrumentation or data beyond
hourly (or similar low frequency) records of aggregate energy
consumption, as is often easily available from power utility
companies due to the increasing prevalence of smart meters. As
well, the focus is on pragmatic approaches that are likely to
provide useful information for typical household electricity
consumption in contrast to methods more suited to industrial
environments. Another notable feature is that disaggregation is
performed not centrally at the utility company or similar
entity with data from a large number of households but instead
in a distributed and independent manner at each household. This
feature provides two key benefits: (1) It permits the injection
of information known to a household but not (easily) by others
in order to simplify the problem. (2) It provides better
privacy protections for such data.
Non-Intrusive Load Monitoring (NILM); Electrical Load
Disaggregation; Smart Meters; Data Integration. Data
Visualization.
-
Classification of Small Molecules
Regulating Circadian Rhythm. Sudarshan S. Chawathe. In
Proceedings of the 13th IEEE Annual Ubiquitous Computing,
Electronics, and Mobile Communication Conference (IEEE UEMCON
2022), New York, NY, October26-29 2022.
The process of drug discovery using in silico methods often
produces datasets with a very large number of attributes
(fields) per instance (record). Automated classification of such
data on properties such as toxicity provides significant benefits
for drug design but must cope effectively with the large number
of attributes and the relatively small number of instances.
This paper studies this problem in the context of a dataset,
from prior work, used to discover promising small molecules for
controlling circadian rhythm in humans. By identifying a
suitable small subset of the attributes that are effective for
this classification task, experimental results indicate
accuracies that compare very favorably with prior work on the
same data.
Small Molecules; Scientific Data; Data Integration; Data
Visualization; Classification.
-
Automated Determination of Mushroom
Edibility Using an Augmented Dataset. Sudarshan S.
Chawathe. In Proceedings of the IEEE World AI IoT Congress (AIIoT
2022), Seattle, Washington, June6-9 2022.
This paper studies methods and datasets for automated
classification of mushrooms as edible or poisonous based on
easily observable properties such as colors, textures, and
dimensions of mushroom parts. The focus is on data-intensive
methods that build upon recent work that has led to an
augmented database of mushroom features. This dataset is
studied in detail with the goal of explicating properties and
easing further use of the dataset by others. The merit of the
database features for the classification task is quantified using
several metrics. Results quantify the accuracy and efficiency of
classification using all and only a few of the features.
Mushroom Database; Classification and Taxonomy; Scientific Data;
Data Integration; Machine Learning.
-
Optical Features for Automated
Determination of Agricultural Product Varieties. Sudarshan
S. Chawathe. In Proceedings of the IEEE World AI IoT Congress
(AIIoT 2022), Seattle, Washington, June6-9 2022.
This paper studies methods to determine varieties of
agricultural specimens using features extracted from optical
images generated by low-cost commodity hardware and simple,
efficient algorithms. It presents a framework for this and some
related tasks of agricultural informatics, with a focus on
data-intensive aspects. It describes a system implementation
that permits such data to be iteratively and interactively
explored and studied while also permitting efficient programmatic
access. The core classification problem of determining a raisin
variety is studied experimentally and the quantitative results
are competitive with prior work. Some of the methods generate
simple, human-understandable classifiers, of which a few
examples are presented. Data exploration and visualization is
implemented using self-organizing maps (SOMs) and several
examples of useful visualizations are described.
Agricultural Informatics; Data Exploration and Visualization;
Self-Organizing Maps (SOMs); Classification; Machine Learning.
-
Predicting Bicycle Package Delivery
Demand Using Historical Spatiotemporal Data. Sudarshan S.
Chawathe. In Proceedings of the 12th IEEE Annual Computing and
Communication Workshop and Conference (IEEE CCWC 2022), Las
Vegas, Nevada, January26-29 2022.
The primary task addressed by this paper is the prediction of
current, or near future, demand for package deliveries at a
location using spatiotemporal historical records for that
location and for others near it. This work adopts a data-driven
approach and describes methods for exploring and visualizing
such datasets in order to gain a better understanding of the
domain and to select appropriate specific methods for tasks such
as demand prediction and location identification. As a concrete
example, the paper uses such a dataset recently provided by the
Pedal Me service in London.
Demand Prediction; Package Deliveries; Logistics;
Visualization; Data Exploration; Data Science.
-
Classification of Dry Beans Using Image
Features. Sudarshan S. Chawathe. In Proceedings of the
12th IEEE Annual Ubiquitous Computing, Electronics, and Mobile
Communication Conference (IEEE UEMCON 2021), New York, NY,
December1-4 2021.
This paper presents human-understandable methods for automated
classification of dry beans using features extracted from
optical images. It presents a detailed study of these features
in the context of classification by examining their merits and
the effect of using a reduced feature set. It also presents the
results of constructing self-organizing maps (SOMs) for these
features. An important result is that classification limited to
human-understandable methods for this task does not incur any
penalty in accuracy and comes with the benefit of significantly
lower computational costs. Another result is that SOMs applied
to this data provide a useful visualization that invites
further study.
Agricultural Informatics; Dry Beans; Classification;
Self-Organizing Maps (SOM).
-
Inferring Human Activity Using Wearable
Sensors. Sudarshan S. Chawathe. In Proceedings of the 12th
IEEE Annual Ubiquitous Computing, Electronics, and Mobile
Communication Conference (IEEE UEMCON 2021), New York, NY,
December1-4 2021.
This paper presents methods that use data from wearable
sensors, such as those found in low-cost commodity hardware, to
infer the human activity (such as reading or walking)
corresponding to the sensor readings. A related task is the
identification of individuals based on the same data. The
classification accuracy of the methods used in this work is
higher than earlier work using the same dataset. Further, a
significant reduction in the number of sensor data streams
produces only a very small impact on this accuracy, which is a
feature of practical significance due to implications for
network bandwidth and energy budgets in such systems.
Human Activity Recognition (HAR); Wearable Sensors; Sensor
Data; Classification; Machine Learning.
-
Human Identification by Gait Using
Body-Worn Sensors. Sudarshan S. Chawathe. In Global
Conference on Artificial Intelligence and Applications (GCAIA
2021), Jaipur, Rajasthan, India, September8-10 2021.
This paper studies methods for identifying human individuals
and gender using gait-related features as measured by sensors
worn on the body. A recently published dataset due to prior
work is used to study the effectiveness of well established and
efficient methods for such identifications. The dataset is based
on experiments with 16 participants wearing sensors that are
part of a widely used gait-sensing platform. The accuracies of
the best of these methods compare favorably with those reported
by prior work. Since the records in the dataset are
characterized by a very large number of fields (323 attributes
per record), methods for attribute selection are of particular
interest and are also studied. The underlying implementation is
briefly described, with a focus on some data management
challenges posed by the large number of attributes. A notable
result is that prediction accuracies of several competitive
methods are not diminished even when the number of attributes
is reduced very drastically using attribute-selection methods
based on metrics such as ReliefF and Symmetrical Uncertainty.
Human Gait; Body-Worn Sensors; Data Management; Classification;
Machine Learning.
-
Using Data from In-Vehicle Recommender
Systems to Predict Traveler Characteristics. Sudarshan S.
Chawathe. In Global Conference on Artificial Intelligence and
Applications (GCAIA 2021), Jaipur, Rajasthan, India,
September8-10 2021.
In-vehicle recommender systems may be used to present travelers
with offers (such as coupons) customized by location, history,
and other contextual information. Such systems both utilize and
augment a dataset that records which offers are accepted and
under what contextual conditions. This paper studies the use of
such datasets to make predictions on whether a coupon presented
to a traveler with some known characteristics and in a certain
context relative to travel parameters is likely to be accepted.
It also studies the use of such data to infer traveler
characteristics based on coupon acceptance and related data.
This work emphasizes the use of simple and understandable
(explainable, for humans) models whose examination is likely
not only to provide greater confidence in predictions but also
to permit design of offers customized to elicit desired
responses and information from travelers. Using a recently
published dataset due to prior work, these methods are studied
experimentally both quantitatively and qualitatively (by
examining a few concrete models).
In-Vehicle Recommender Systems; Intelligent Transportation
Systems; Data Science; Classification; Machine Learning.
-
Epidemiological Spatiotemporal Data
Exploration and Prediction. Sudarshan S. Chawathe. In IEEE
World AI IoT Congress (AIIoT 2021), Seattle, Washington, May10-13
2021.
This paper addresses epidemiological spatiotemporal datasets
such as those reporting the number of cases of infectious
diseases over time and by geographical location. It studies
methods for exploratory data analysis as well as prediction of
future cases based on prior data. It emphasizes methods that
provide explainable predictions, such as those based on rules
and decision trees. These methods are studied in the context of
a recently published dataset of weekly Chickenpox cases in
Hungarian counties over a 10-year period. As noted in prior
work, this dataset exhibits several features, such as
seasonality and heteroskedasticity, that make the prediction
task especially challenging. This paper describes some results
of an experimental study of both the exploratory and predictive
aspects.
Spatiotemporal Data; Data Exploration; Self-Organizing Maps
(SOMs); Prediction; Machine Learning.
-
Explainable Predictions of Industrial
Emissions. Sudarshan S. Chawathe. In IEEE International
IOT, Electronics and Mechatronics Conference (IEMTRONICS 2021),
Toronto, Canada, April21-24 2021.
Predictive emission monitoring systems for gas turbines are
important in the power generation industry. A key task in this
context these systems is the prediction of flue gas emissions
using process and environmental measurements that are easier to
obtain. This paper presents methods for such predictions with
an emphasis on explainability. A notable result is that despite
the potential restrictions imposed by this emphasis, the
numerical accuracy compares very favorably with prior work that
uses models that are more difficult to explain.
Predictive Emission Monitoring Systems; Exhaust Emissions
Prediction; Gas Turbines; CO; NOx; Machine Learning.
-
Data Structures for Ordered Short
Character-Sequences. Sudarshan S. Chawathe. In Proceedings
of the 11th IEEE Annual Computing and Communication Workshop and
Conference (IEEE CCWC 2021), Las Vegas, Nevada, January27-30
2021.
A lexicon, or dictionary of key-value pairs, is a general
abstraction that is widely used in diverse areas of computer
science, notably compilers and database systems. The primary
operations of interest on such lexicons are membership testing
and extraction of a value associated with a key appearing in
the lexicon. This paper focuses on the special case of ordered
lexicons with keys that are short sequences of characters. An
important motivating application is the representation of the
large and growing lexicon of emoji in the Unicode standard. It
presents space-efficient data structures for some specialized but
practically significant cases. In particular, the methods take
advantage of contiguous sequences of keys in the lexicon to
yield a very highly compressed representation while maintaining
efficiency in lookup operations.
Data Structures; Lexicons; Unicode; Emoji.
-
Analyzing Auction Data for Anomalous
Bidding. Sudarshan S. Chawathe. In Proceedings of the 11th
IEEE Annual Ubiquitous Computing, Electronics, and Mobile
Communication Conference (IEEE UEMCON 2020), New York, NY,
October28-31 2020.
Online auctions as exemplified by sites such as ebay.com are
responsible for very large volumes of transactions and monetary
value. Their growth has also led to a growth in fraudulent
activities in these markets. This paper studies transaction
data from such auctions with the goal of using it to detect
anomalous and potentially fraudulent bidding. To that end, it
explores several approaches based on classification, clustering,
and visualization. The quantitative results signal very high
accuracy in classification but their promise is tempered by some
limitations of the experimental dataset. Clustering and
visualizations using self-organizing maps (SOMs) is found to be
more effective for this data than clustering using more
conventional methods such as k-means. In particular, the SOMs
reveal several interesting relationships among the dataset's
attributes and their correlations to anomalous bidding.
Online Auctions; Fraud Detection; Classification; Clustering;
Visualization; Machine Learning; Self-Organizing Maps (SOMs).
-
Estimating Predicate Selectivities in a
NoSQL Database Service. Sudarshan S. Chawathe. In
Proceedings of the 11th IEEE Annual Ubiquitous Computing,
Electronics, and Mobile Communication Conference (IEEE UEMCON
2020), New York, NY, October28-31 2020.
An estimate of the number of items in a database that satisfy
an equality or range predicate is useful for several tasks,
such as cost-based query optimization, provisioning of system
resources, and determining the financial costs of using database
services. In a traditional database system, such estimates are
computed and used internally by the system and have been well
studied. In contrast, such estimates have not received much
attention in the context of a cloud-based database service,
where they must be computed by the application that uses the
service using only the limited features of the interface
provided by the service. This paper motivates and formulates
the selectivity-estimation problem for database services. It
describes the characteristics of this problem that distinguish
it from the analogous problem in traditional database systems.
It outlines some subproblems and methods to address them. It
provides a method for estimating selectivities based on random
sampling along with some experimental results.
Cloud Databases; Database Services; Cost Estimation; DynamoDB.
-
Mining Bike-Share Data. Sudarshan
S. Chawathe. In Proceedings of the IEEE International Smart
Cities Conference (IEEE ISC2 2020), September28 2020.
This paper studies methods for processing bike-share datasets
for the purpose of extracting information that can assist
riders, bike-share program designers, city planners, and
others. Bike-share datasets describe how shared bicycles are
used in an urban environment. They vary considerably in
composition and coverage but typically include information such
as the locations (bicycle racks) of origin and destination,
timestamps, and identifiers for bicycles and riders. This paper
provides methods for visualizing such data in a manner that
distills useful patterns and for using the data to predict
usage. In order to overcome the difficulty in generating
meaningful clusters using conventional methods, it presents a
novel method of clustering that uses graph condensations. It
describes an experimental study of these methods using a
publicly available dataset from a popular bike-share program.
Bikeshare; Transportation; Data Analysis.
-
Using Accelerometers in Mobile Phones to
Estimate Blood Alcohol Levels. Sudarshan S. Chawathe. In
Proceedings of the IEEE International Smart Cities Conference
(IEEE ISC2 2020), September28 2020.
This paper studies methods for determining the blood alcohol
content of individuals by using data from commodity
accelerometers in mobile phones carried on person. A significant
challenge is that such data is very noisy and often irregular
(many large gaps) as well. This paper provides a detailed
analysis of a recently released dataset of accelerometer traces
and associated readings of transdermal alcohol content (TAC).
It describes a set of features extracted from the raw
accelerometer traces that are effective for the task of
determining TAC levels. It presents results of an experimental
study of regression methods that use these features to predict
TAC levels from accelerometer traces as well as of
classification methods that predict whether the person carrying
the mobile phone has TAC levels above given thresholds.
Alcohol Consumption; Accelerometers; Regression; Classification.
-
Diagnostic Classification Using Hepatitis
C Tests. Sudarshan S. Chawathe. In IEEE International IOT,
Electronics and Mechatronics Conference (IEMTRONICS 2020),
Vancouver, Canada, September9-12 2020.
This paper describes methods for automated classification of
individuals by Hepatitis C medical category using data from a
series of commonly used diagnostic tests. The methods are
evaluated experimentally using a publicly available dataset
from prior work. The accuracy of some methods compares
favorably with similar results reported in prior work. In
addition to quantitative results on prediction accuracy,
training and testing times, and model sizes, the paper includes
a detailed look at some concrete representative classifiers
generated by a few of the competitive methods, permitting a
human domain expert to further study the models and classifiers.
Medical Informatics; Classification.
-
Detecting Physical Activities Using
Body-Worn Accelerometers. Sudarshan S. Chawathe. In IEEE
International IOT, Electronics and Mechatronics Conference
(IEMTRONICS 2020), Vancouver, Canada, September9-12 2020.
This paper addresses the task of using data from accelerometers
attached to a person's body to determine the kind of physical
activity being performed by that person. The activities of
interest are routine ones such as sitting, walking up a flight
of stairs, walking, and jogging. The paper describes methods
for segmenting the time-series data from accelerometers and for
extracting features that are effective for determining
activities when used in conjunction with well established
classification algorithms. These methods are implemented in a
prototype that is used to evaluate their effectiveness on a
publicly available dataset of tagged accelerometer traces. The
prototype also provides intuitive visualizations of the
accelerometer traces, allowing a human expert to gain a better
understanding of both the dataset and the predictions from the
classifiers. Although the methods in this paper use fewer and
simpler features extracted from the raw accelerometer data,
they provide higher accuracies when compared to those reported
in prior work on the experimental dataset.
Human Physical Movement; Activities of Daily Living;
Accelerometers; Classification.
-
Characterizing Shoulder Implants in X-Ray
Images. Sudarshan S. Chawathe. In Global Conference on
Artificial Intelligence and Applications (GCAIA 2020), Jaipur,
Rajasthan, India, September8-10 2020.
This paper studies methods for characterizing shoulder implants
in X-ray images of the shoulder, upper arm, and chest region.
The task of characterizing in this context entails sub-tasks
such as detecting the presence of an implant, segmenting it
from the rest of the image, determining its orientation
relative to the image and other features such as bones,
detecting shape features of the implant, and determining
properties of the implant (e.g., manufacturer, model). The task
is complicated due to the proximity and similarity of bones and
other objects as well as due to potentially low image contrast,
spurious edges, and other artifacts. This paper describes the
challenges and outlines and evaluates solutions using a
recently published dataset of 597 X-ray images of shoulder
implants.
Medical Imaging; Medical Informatics; Image Processing;
Classification.
-
Human-Understandable Classifiers for COPD
From Biosensor Data. Sudarshan S. Chawathe. In Global
Conference on Artificial Intelligence and Applications (GCAIA
2020), Jaipur, Rajasthan, India, September8-10 2020.
This work addresses the task of analyzing data from a biometric
sensor operating on saliva samples in order to predict the
sample-donor's status in regard to COPD (Chronic Obstructive
Pulmonary Disease). It emphasizes the use of
human-understandable classification methods, such as those based
on a small number of rules. Using recently published data, it
studies the characteristics of such biosensor data and presents
some concrete results in that context. It also summarizes the
results of an experimental evaluation of such methods on this
dataset.
Medical Diagnostics; Classification; Medical Informatics;
Machine Learning.
-
Index-Selection for Minimizing Costs of a
NoSQL Cloud Database. Sudarshan S. Chawathe. In
Proceedings of the 17th International Conference on Economics of
Grids, Clouds, Systems and Services (GECON 2020), Izola,
Slovenia, September15-17 2020. Springer LNCS.
The index-selection problem in database systems is that of
determining a set of indexes (data-access paths) that minimizes
the costs of database operations. Although this problem has
received significant attention in the context of relational
database systems, the established methods and tools do not
translate easily to the context of modern non-relational
database systems (so-called NoSQL systems) that are widely used
in cloud and grid computing, and in particular systems such as
DynamoDB from Amazon Web Services. Although the index-selection
problem in these contexts appears simple at first glance, due to
the very limited indexing features, this simplicity is
deceptive because the non-relational nature of these databases
and indexes permits more complex indexing schemes to be
expressed. This paper motivates and describes the
index-selection problem for NoSQL databases, and DynamoDB in
particular. It motivates and outlines a cost model to capture
the specific monetary costs associated with database operations
in this context. The cost model has not only been carefully
checked for consistency using the system documentation but also
been verified using actual usage costs in a live DynamoDB
instance.
Cloud Computing; Cost Model; Index Selection; DynamoDB; NoSQL
Databases; Physical Database Design.
-
Organizing and Compressing Collections of
Files Using Differences. Sudarshan S. Chawathe. In
Proceedings of the 24th International Database Engineering and
Applications Symposium (IDEAS 2020), Incheon/Seoul, South Korea,
August12-18 2020. ACM.
A collection of related files often exhibits strong similarities
among its constituents. These similarities, and the dual
differences, may be used for both compressing the collection and
for organizing it in a manner that reveals human-readable
structure and relationships. This paper studies methods for
such organizing and compression of file collections using
differences and presents the results of an experimental
evaluation on a well known public dataset.
File Collections; Differencing; Compression.
-
Mining Frequent Differences in File
Collections. Sudarshan S. Chawathe. In Proceedings of the
Ninth IEEE International Workshop on Data Integration and Mining
(DIM-2020), Las Vegas, Nevada, August11-13 2020. IEEE. In
conjunction with IEEE 21st International Conference on
Information Reuse and Integration for Data Science (IRI 2020).
Collections of textual files, or documents, with substantial
inter-document similarities are common in diverse domains. A
practically significant class of such similarities, and the dual
differences, are well characterized by edit scripts, or
colloquially diffs, that use a simple sequence model for
documents. The study of such diffs provides valuable insights
into the inter-document relationships within a collection and
can guide data integration within and across collections. This
paper describes a framework for such study that is based on
frequently occurring inter-document differences. It motivates
and defines a general problem of mining frequent differences and
outlines some specific instances. It presents the design and
implementation of a prototype system for interactively
discovering and visualizing frequent differences. A notable
feature of this method is its use of difference-components, or
deltas, to bootstrap the discovery of interesting structure in
file collections. The paper describes a preliminary experimental
evaluation of the method and implementation on a widely used
corpus of file-collections.
File Collections; Differencing; Data Mining; Data Integration.
-
Efficient File Collections for Embedded
Devices. Sudarshan S. Chawathe. In Proceedings of the 8th
Workshop on Communications in Critical Embedded Systems (WoCCES
2020), Rennes, France, July7 2020. IEEE.
This paper studies methods for efficiently transferring and
storing collections of related files in embedded devices and
other environments with limitations on storage, network, and
energy use. Files in collections based on purpose (e.g., system
configurations) or other aspects often exhibit substantial
inter-file similarities. These similarities may be used to
achieve significant reductions in the network resources required
for transferring or updating the collection, as well as for the
storage resources required on the embedded devices on which it
is stored.
Embedded Devices; File Collections; Compression.
-
Rice Disease Detection by Image
Analysis. Sudarshan S. Chawathe. In Proceedings of the
10th IEEE Annual Computing and Communication Workshop and
Conference (IEEE CCWC 2020), Las Vegas, Nevada, January6-8 2020.
This paper provides a method for automatically classifying
diseases in rice plants by analyzing photographs of rice
leaves. The method uses image processing algorithms to detect
leaves and likely disease-induced lesions in the leaves. Next,
several attributes are computed based on the dimensions of
leaves and lesions, the numbers and shapes of lesions, as well
the color characteristics of lesions and intact portions of
leaves. These attributes are used to build classification models
using well established algorithms. The method is evaluated
using a publicly available database of rice leaf images.
Rice Disease; Rice Leaf; Image Processing; Classification;
Machine Learning.
-
Topic Analysis of Climate-Change
News. Sudarshan S. Chawathe. In Proceedings of the 10th
IEEE Annual Computing and Communication Workshop and Conference
(IEEE CCWC 2020), Las Vegas, Nevada, January 2020.
This paper explores the application of computational methods to
the analysis of the large and growing corpus of news articles
and related data on climate change. Topics are analyzed using
Latent Dirichlet Allocation and methods customized to specific
news sources that take advantage of keywords and other metadata
that may be present. Results of this method on news articles
drawn over several months are presented.
Climate Change; News; Topic Modeling; Machine Learning
-
Cost-effective data-collection systems for
citizen science. Sudarshan S. Chawathe. In Proceedings of
the Acadia National Park Science Symposium, Schoodic Education
and Research Center, Acadia National Park, Maine, October24 2019.
Citizen science efforts often include data collection by
volunteers. Computerizing such data collection provides several
benefits, including improved data consistency, shorter time from
collection to use, and immediate feedback to the data
collectors. Implementing such a computerized data collection
system is often challenging because it is difficult to accurately
estimate the level of participation and, therefore, the
required load-handling capacity. Overestimating the capacity
results in unnecessary infrastructure costs while
underestimating it leads to sluggish or failed systems. The
so-called serverless or cloud based systems are attractive in
this context because they permit the apparent (paid)
infrastructure to scale with load. Determining cost profiles of
different designs in this environment and, therefore, selecting
a suitable one are challenging tasks that are addressed by this
work.
-
Data Modeling for a NoSQL Database
Service. Sudarshan S. Chawathe. In Proceedings of the 10th
IEEE Annual Ubiquitous Computing, Electronics, and Mobile
Communication Conference (IEEE UEMCON 2019), New York, NY,
October10-12 2019. Columbia University.
Cloud-hosted NoSQL database services, such as AWS DynamoDB,
offer significant advantages, including low start-up costs, high
performance and availability, wide scalability, and ease of
deployment and management. These advantages have led to their
rapid adoption and growth. However, the data storage, querying,
and modification features supported by such NoSQL services are
very rudimentary in comparison with those of relational and
object database systems. Further, data modeling decisions made
to map application requirements to the supported NoSQL model
have very significant impact on not only performance but also
financial cost incurred in using the services. Unlike the well
developed body of work for relational- and object- database
design, there is a great dearth of systematic procedures for
NoSQL database design. This paper addresses this design problem
by providing methods that map standard data models to the
typically idiosyncratic and rudimentary models supported by
NoSQL database services, using AWS DynamoDB as a specific
instance.
Cloud Computing; NoSQL; Databases; Data Modeling.
-
Using Historical Data to Predict Parking
Occupancy. Sudarshan S. Chawathe. In Proceedings of the
10th IEEE Annual Ubiquitous Computing, Electronics, and Mobile
Communication Conference (IEEE UEMCON 2019), New York, NY,
October10-12 2019. Columbia University.
This paper describes methods that use historical data on the
rates of occupancy of car parking facilities to predict future
occupancy rates. The methods are evaluated using a publicly
available dataset of car park occupancy rates. The results
suggest that a usable level of prediction accuracy may be
achieved using only a modest amount of data that is easy to
gather using current technologies.
Intelligent Transportation Systems; Smart Cities; Car Parking;
Regression; Machine Learning.
-
Trusted Remote Function Interface.
Mark E. Royer and Sudarshan S. Chawathe. In Proceedings of the
10th IEEE Annual Ubiquitous Computing, Electronics, and Mobile
Communication Conference (IEEE UEMCON 2019), New York, NY,
October10-12 2019. Columbia University.
The Trusted Remote Function Interface (TRFI) is a small library
that exposes services via a REST API to allow function
execution with scientific programming languages. Functional
units are uploaded to a remote server using the provided REST
API. The API stores registered functions for later execution.
Maintaining code using this technique allows clients to
repeatedly execute functions without having the native
language, typically Octave or Python, installed on the client
machine. A common problem in scientific applications is the
requirement for a program to interface with scientific scripting
languages. Typically, this is not a straightforward approach
for accomplishing the data exchange and subsequent function
execution on that data from popular languages such as Java or
JavaScript. This task is extraordinarily cumbersome if the
interpreter, used by the scientific programming language, is not
installed locally. By separating the function signature from
the underlying implementation, and providing a uniform REST
API, the TRFI library allows function interfacing in two ways.
First, direct interfacing by using the equipped Java library.
Second, the more common scenario is interfacing remotely by
deploying the library using a JAX-RS compatible web server. The
result of the TRFI library's design and the provided REST API
is the facilitation of code interoperability and reuse for
scientific applications.
Java; REST; Octave; Python; function interoperability; data
exchange
-
Unformatted, Certified Scientific
Objects. Mark E. Royer and Sudarshan S. Chawathe. In
Proceedings of the 10th IEEE Annual Ubiquitous Computing,
Electronics, and Mobile Communication Conference (IEEE UEMCON
2019), New York, NY, October10-12 2019. Columbia University.
We present an approach for scientific data management systems to
apply certificates to scientific objects, which are typically
unformatted datasets, to facilitate analysis by climate
scientists. For a program to process data, the program requires
cleaned data in a form that supports automatic manipulation.
Most systems require that data must adhere to a specific format
to achieve that goal. The technique described in this paper
takes the opposite approach; instead, any dataset may be
imported and manipulated in the system. But upon initial
import, however, only a subset of system functions may work
with any given dataset. As the data is refined and transformed
by system functions, more functions may become compatible.
Certificates are associated with objects that pass constraint
validation within the system to ensure that they conform to
function requirements. The attached object constraints
represent invariant properties of the object, which may be used
by functions in the system as function preconditions.
Furthermore, the functions defined in the system may associate
certificates with the newly generated results. Certificates
related to function results are effectively function
postconditions, which in turn are used to associate certificates
with the objects generated in the system. Additionally,
attached object certificates reflect the refinement of data into a
more pristine version. This paper describes the technique for
modeling and enforcing the constraints for data scientists that
have similar requirements.
Data analysis; Constraints; Data processing
-
Indoor-location classification using RF
signatures. Sudarshan S. Chawathe. In Proceedings of the
17th IEEE International Symposium on Network Computing and
Applications (IEEE NCA 2019), Cambridge, MA, September26-28 2019.
Indoor localization using radio-frequency signals in the 2.4
GHz band is attractive due to availability of low-cost
commodity WiFi hardware. However, using such signals for
localization is challenging due to signal-propagation
complexities such as multipath, fading, and shadowing. This
paper describes a method for classifying indoor locations using
frequency-domain signatures of RF signals. The method is
evaluated using a publicly available dataset of detailed signal
measurements in a real environment.
indoor localization; radio-frequency signals; classification;
machine learning
-
Cost-Based Query-Rewriting for DynamoDB
(Work in Progress). Sudarshan S. Chawathe. In Proceedings
of the 17th IEEE International Symposium on Network Computing and
Applications (IEEE NCA 2019), Cambridge, MA, September26-28 2019.
DynamoDB is a popular NoSQL database service that permits
queries in a restrictive but useful query language. The metered
costs (which translate to financial costs) of executing such
queries are measured in units of provisioned capacity or number
of requests. Costs of equivalent queries may differ by orders of
magnitude but the onus of choosing a low-cost equivalent query
is on the service's client and must be performed by query
rewriting. This paper formulates this query-rewriting problem
for DynamoDB and outlines methods for choosing low-cost
equivalent queries.
DynamoDB; query evaluation; cost estimation; databases; NoSQL;
cloud computing.
-
Hand Gestures from Low-Cost
Surface-Electromyographs. Sudarshan S. Chawathe. In 2019
IEEE National Aerospace and Electronics Conference (NAECON),
July15-19 2019. 71st annual conference.
Low-cost and commodity off-the-shelf surface electromyographs
(sEMGs) may be used for unobtrusive detection of human hand
gestures. Although these EMG signals are not as detailed as
conventional ones, an experimental investigation of feature
engineering and classification reveals that they can yield
accurate hand gesture information.
hand gesture detection; surface electromyograph; sEMG; EMG;
COTS; sensors; classification; machine learning
-
Ultrasonic Flowmeter Diagnosis by
Classification. Sudarshan S. Chawathe. In 2019 IEEE
National Aerospace and Electronics Conference (NAECON), July15-19
2019. 71st annual conference.
Modern ultrasonic flowmeters provide routine diagnostic
information that may be used to infer their health. This
inference task is modeled as a classification problem and
studied experimentally using a publicly available dataset. A
few classifiers, such as Bayesian Networks, provide good
accuracy and also suggest relationships among the diagnostic
variables.
ultrasonic flowmeter; diagnostics; classification; machine
learning
-
Computational Analysis of Climate-Change
Discourse in News and Social Media. Sudarshan S. Chawathe.
In Proceedings of the 27th Annual Harold W. Borns Symposium, May
2019.
The study of topics that frame the discourse of climate change
in news and social media is useful for understanding media and
public perceptions of the field and its recent developments.
Computational methods for topic modeling, syntactic analysis,
and guided data exploration may be applied to readily available
big-data streams to extract topics and related information in
near-real time.
-
Ice core dating integration in the
Climate Data Workbench. Mark E. Royer, Sudarshan S.
Chawathe, Andrei V. Kurbatov, and Paul A. Maywewski. In
Proceedings of the 27th Annual Harold W. Borns Symposium, May
2019.
We present the software integration of ice core dating tools to
the Climate Data Workbench (P301 system). The implementation
allows researchers to use different annual indicators in ice
core time series in order to develop and apply time scales.
During the creation of the time scale, an interpolated, dated
version of the actively investigated core is presented to the
researcher in real-time.
-
Condition Monitoring of Hydraulic Systems
by Classifying Sensor Data Streams. Sudarshan S. Chawathe.
In Proceedings of the 9th IEEE Annual Computing and Communication
Workshop and Conference (IEEE CCWC 2019), Las Vegas, Nevada,
January 2019.
Condition-based maintenance (CBM) of hydraulic systems requires
methods for condition monitoring: Sensors installed in a
hydraulic system for this purpose generate streams of real-time
data that must be analyzed to accurately characterize the
health of the system. Prior work has developed an experimental
hydraulic system with such an installation and yielded a public
dataset of sensor readings with associated values of condition
variables that quantify the system's health. This paper
presents classification-based methods for inferring these
condition variables from the sensor data streams. These methods
significantly improve on the classification accuracy reported in
prior work on this data. Further, this accuracy is maintained
even when the number of sensor-based attributes used as input
is substantially reduced.
condition monitoring, condition-based maintenance, hydraulic
systems, sensors, classification
-
Recognizing Human Falls and Routine
Activities Using Accelerometers. Sudarshan S. Chawathe. In
Proceedings of the 9th IEEE Annual Computing and Communication
Workshop and Conference (IEEE CCWC 2019), Las Vegas, Nevada,
January 2019.
Detecting falls and other mishaps using data from sensors worn
by individuals is an important task with applications in
healthcare. A related task is using such sensor data to detect
routine activities of daily living. This paper models such
detection of falls and routine activities as a classification
problem. Using a publicly available dataset of real
accelerometer traces generated by participants performing
intentional falls and other activities, the efficacy and
performance of several classifiers are studied experimentally.
fall detection, activities of daily living, accelerometers,
sensors, classification
-
Clustering Blockchain Data.
Sudarshan S. Chawathe. In Olfa Nasraoui and Chiheb-Eddine Ben
N'cir, editors, Clustering methods for Big Data Analytics:
techniques, toolboxes and applications, chapter 3. Springer,
2019.
Blockchain datasets, such as those generated by popular
cryptocurrencies Bitcoin, Ethereum, and others, are intriguing
examples of big data. Analysis of these datasets has diverse
applications, such as detecting fraud and illegal transactions,
characterizing major services, identifying financial hotspots,
and characterizing usage and performance characteristics of
large peer-to-peer consensus-based systems. Unsupervised
learning methods in general, and clustering methods in
particular, hold the potential to discover unanticipated
patterns leading to valuable insights. However, the volume,
velocity, and variety of blockchain data, as well as the
difficulties in evaluating results, pose significant challenges to
the efficient and effective application of clustering methods to
blockchain data. Nevertheless, recent and ongoing work has
adapted classic methods, as well as developed new methods
tailored to the characteristics of such data. This chapter
motivates the study of clustering methods for blockchain data,
and introduces the key blockchain concepts from a data-centric
perspective. It presents different models and methods used for
clustering blockchain data, and describes the challenges and
some solutions to the problem of evaluating such methods.
-
The Tiny Java Library for Maintaining
Model Provenance. Mark E. Royer and Sudarshan S. Chawathe.
In Proceedings of the 9th IEEE Annual Ubiquitous Computing,
Electronics, and Mobile Communication Conference (IEEE UEMCON
2018), New York, NY, November 2018. Columbia University.
We present a small library for maintaining the provenance of
objects in a software model called The Tiny Java Library for
Maintaining Model Provenance (TJLP). A unique characteristic of
the library is that it may be applied to existing software
models with minimal modification. The library allows the
software developer to introduce the ability to move back (undo)
and forward (redo) through an object's instance history with
minimal code modification. The requirement is that the model
implements the Model interface. Finally, methods that are
considered critical in the object's provenance are adorned with
an Undoable annotation. The code necessary to maintain the
object's history is automatically inserted into the critical,
undoable-method bytecode when the class definition is loaded by
an extended class loader. The states of the model objects are
preserved both in memory and on disk to accommodate various
computer system configurations. The library performs well for
small to medium size models using the default settings, but it
may be customized in order to perform better with larger
models, especially if the model size approaches the RAM of the
underlying computer system.
Java annotations, data provenance, bytecode injection
-
A software workbench for studying past
climate. Mark E. Royer, Sudarshan S. Chawathe, and Andrei
V. Kurbatov. In Proceedings of the Acadia National Park Science
Symposium, Bar Harbor, Maine, October20 2018.
The study of past climate enables a better understanding of
present and future climate conditions. However, directly
measured data for temperature and other climate variables is
available for only the recent past (a few hundred years). Study
of climate in the more distant past, from centuries to
millennia before present, requires the use of indirect methods
which use other variables as proxies. Chief among such methods
is the use of data derived from ice cores. Analyzing such
ice-core data in order to gain insights into past climate is a
complex task that requires data from diverse sources to be
combined, transformed, and visualized in multiple and often
novel ways. In the past, such analysis was often performed
using an ad hoc collection of software tools, such as
spreadsheets and plotting programs. There are two primary
reasons why this past approach to analyzing data is no longer
effective: First, recent technological advances in the physical
and chemical processing of ice cores to extract measurements
have resulted in orders-of-magnitude increase in the volume of
data. Not only does this volume of data render some software
tools inoperable but also it makes it difficult for a human to
interpret data visually. Second, and more important, ad hoc
application of multiple tools to analyze data, even when it
produces usable results, typically leaves no systematic record
of the precise sequence of transformations that yield a data
product, such as a chart of temperature over time, from the
original data sources. The P301 project addresses these
shortcomings of prior data analysis methods by providing an
interactive, graphical software workbench with a few notable
features in this context: First, it can analyze even the
largest ice-core datasets available today, and more, in
interactive times (a few seconds at most). Second, it permits a
scientist to interactively use, define, and compose software
tools for analyzing data in diverse and powerful ways. Third,
all transformations of both tools and data are automatically
recorded by the system in a manner that permits examination,
study, transformation, and workflow management.
-
Monitoring IoT networks for botnet
activity. Sudarshan S. Chawathe. In Proceedings of the
17th IEEE International Symposium on Network Computing and
Applications (IEEE NCA 2018), Cambridge, MA, November 2018.
The Internet of Things (IoT) has rapidly transitioned from a
novelty to a common, and often critical, part of residential,
business, and industrial environments. Security vulnerabilities
and exploits in the IoT realm have been well documented. In
many cases, improving the security of an IoT device by
hardening its software is not a realistic option, especially in
the cost-sensitive consumer market or in legacy-bound
industrial settings. As part of a multifaceted defense against
botnet activity on the IoT, this paper explores a method based
on monitoring the network activity of IoT devices. A notable
benefit of this approach is that it does not require any special
access to the devices and adapts well to the addition of new
devices. The method is evaluated on a publicly available
dataset drawn from a real IoT network.
Internet of Things (IoT), botnets, network monitoring, machine
learning
-
Analysis of Burst Header Packets in
Optical Burst Switching networks. Sudarshan S. Chawathe.
In Proceedings of the 17th IEEE International Symposium on
Network Computing and Applications (IEEE NCA 2018), Cambridge,
MA, November 2018.
Optical Burst Switching (OBS) networks provide a practical
alternative to optical packet switching and optical circuit
switching by separating control information from the primary
data, sending the former on a separate control channel.
However, this separation also renders OBS networks susceptible
to a denial- or degradation-of-service attack (intentional or
otherwise) when the data provisioned by a header packet on the
control channel does not materialize. This paper addresses the
problem of detecting and characterizing such problems and
describes a method based on monitoring network traffic on the
control and data channels. The method is evaluated on a
publicly available dataset.
optical burst switching, quality of service, machine learning.
classification
-
Indoor Localization Using Bluetooth-LE
Beacons. Sudarshan S. Chawathe. In Proceedings of the 9th
IEEE Annual Ubiquitous Computing, Electronics, and Mobile
Communication Conference (IEEE UEMCON 2018), New York, NY,
November 2018. Columbia University.
Persons and devices in indoor environments, such as office
buildings, may determine their location using Bluetooth LE
beacons, such as iBeacons. Some number of these beacons are
distributed over the environment of interest and their
identifiers and locations are broadcast widely. The vector of
received signal strengths from all these beacons may be
intuitively expected to correlate well with location in the
physical environment. However, the complexities of Bluetooth
signal propagation in environments with obstructions and
channels (walls, furniture, ducts, etc.) make it difficult to
compute locations in this manner from only the signal values
and known locations of beacons. Instead, a data-driven approach
that uses a training set composed of observed signal strength
vectors at known locations is more effective. This paper studies
such methods using a publicly available dataset obtained by
collecting training data in an academic building.
indoor localization, beacons, Bluetooth-LE, iBeacons, machine
learning, data-driven methods
-
Classifying Self-Care Activities of
Children and Youths with Disabilities. Sudarshan S.
Chawathe. In Proceedings of the 9th IEEE Annual Ubiquitous
Computing, Electronics, and Mobile Communication Conference (IEEE
UEMCON 2018), New York, NY, November 2018. Columbia University.
The classification of functioning and disabilities in children
and youths is an important task that informs healthcare. The
ICF-CY (International Classification of Functioning, Disability,
and Health in Children and Youth) provides a standard framework
for such classification. Occupational therapists use the ICF-CY
in conjunction with observations of the routine activities
performed by a child (such as eating, toileting, washing) to
determine a suitable diagnostic group. This paper presents a
method for assisting occupational therapists and others in this
task using machine learning. The method is studied
experimentally using a publicly available dataset of self-care
activities.
ICF-CY, self-care activities, physical and motor disability,
classification, medical informatics
-
HDFJavaIO: A Java library for reading and
writing Octave HDF files. Mark E. Royer and Sudarshan S.
Chawathe. In Proceedings of the 9th IEEE Annual Ubiquitous
Computing, Electronics, and Mobile Communication Conference (IEEE
UEMCON 2018), New York, NY, November 2018. Columbia University.
Scientific Java programs often need to interact with specialized
programming environments, such as Octave and Matlab, that focus
on numerical computations. This paper presents the HDFJavaIO
library that allows Java programs to interact with Octave using
Hierarchical Data Format 5 (HDF5) files, which are commonly used
in the scientific community for working with large data sets.
Because features of HDF5 files include almost all of the
features of NetCDF, this library and method may also be used to
create data files that can be used with NCL scripts and other
applications that use these large-data formats without the need
for further modifications by Java application developers. This
paper presents the relevant details of the Octave HDF5 file
format and the Java techniques used to build the data
interchange library. It also presents the results of an
experimental analysis of the library's performance and its
comparison to existing approaches.
Java, Octave, HDF5, NetCDF, Hierarchical Data Format, data
interchange
-
Recognizing Activities of Daily Living
Using Binary Sensors. Sudarshan S. Chawathe. In
Proceedings of the IEEE International Conference on Universal
Village (IEEE UV 2018), Cambridge, MA, October 2018. MIT.
Activities of Daily Living (ADLs), or a person's routine
activities of self-care, are important factors influencing the
feasibility of home health care or aging in place for many
individuals. Automated, sensor-based recognition of such
activities affords home stay, greater independence and privacy,
and improved quality of life to individuals who would require
stay in a supervised or medical facility. This paper describes
a data-driven framework for the design and deployment of such
an automated system for activity recognition using simple,
unobtrusive, and privacy-friendly binary sensors. It presents
the results of an experimental study, with both numerical and
qualitative observations, of this framework on a publicly
available real dataset.
-
Analysis of Sparse Roadway
Trajectories. Sudarshan S. Chawathe. In Proceedings of the
IEEE International Conference on Universal Village (IEEE UV
2018), Cambridge, MA, October 2018. MIT.
Recent technological advances enable the gathering of extensive
data on vehicular trajectories of large numbers of travelers at
an unprecedented level of detail. Such trajectory datasets
provide a wealth of information for purposes such as urban
planning, carpool formation, and public-transportation design.
This paper describes methods for analyzing and visualizing such
data with an emphasis on sparse-traffic environments. It outlines
the needs of applications in this domain and presents methods
for clustering trajectories and for visualizing the results.
The methods are evaluated by an experimental study on a
publicly available dataset from real travelers.
-
Monitoring Blockchains with
Self-Organizing Maps. Sudarshan S. Chawathe. In
Proceedings of the 2018 International Workshop on Privacy,
Security and Trust in Computational Intelligence (PSTCI 2018),
New York, NY, August 2018. IEEE TrustCom-2018.
Blockchains such as those used by the Bitcoin and Ethereum
cryptocurrencies provide a global, observable record of all
transactions and associated data. Analyzing blockchain data is
useful for tasks such as detecting fraudulent activities,
studying the use and growth of the system, and understanding
its levels of anonymity and traceability. Such analysis is
challenging due to the high volume and rapidly changing
characteristics of popular blockchains. In particular, online
(soft real-time) analysis of blockchains requires methods that
adapt organically to changes in the data. This paper describes
such a method based on self-organizing maps and reports on
experiments using the Bitcoin blockchain data.
-
Improving Email Security with Fuzzy
Rules. Sudarshan S. Chawathe. In Proceedings of the 2018
International Workshop on Privacy, Security and Trust in
Computational Intelligence (PSTCI 2018), New York, NY, August
2018. IEEE TrustCom-2018.
Phishing and other malicious email messages are increasingly
serious security threats. An important tool for countering such
email threats is the automated or semiautomated detection of
malicious email. This paper reports work on using fuzzy rules
to classify email for such purposes. The effectiveness of a
fuzzy rule-based classifier is studied experimentally on a real
dataset and compared with results for other classifiers,
including those based on crisp rules and decision trees. The
human-readability and editability of the classifiers produced by
these methods is also studied.
-
A Low-Overhead Scalable Data-Collection
Service. Sudarshan S. Chawathe. In Proceedings of the
Borns Symposium, May 2018.
We study the large-scale soft-realtime distributed collection,
analysis, and reporting of data, emphasizing low-cost,
low-overhead solutions that scale gracefully as usage varies
over several orders of magnitude.
-
A Tiny Java Library for Maintaining Model
Provenance. Mark E. Royer and Sudarshan S. Chawathe. In
Proceedings of the Borns Symposium, May 2018.
We present a lightweight Java library that simplifies
maintenance of the provenance of software object models. The
implementation is based on annotations that are interpreted by
an extended class loader to inject the Java bytecode to enable
model maintenance.
-
A New Approach for Ultra-High-Resolution
Ice Core Data Processing. Heather Clifford, Nicole
Spaulding, Mark Royer, Sharon Sneed, Elena Korotkikh, Michael
Handley, Andrei Kurbatov, Sudarshan Chawathe, Pascal Bohleber,
Michael McCormick, Alexander More, Christohper Loveluck, and Paul
Mayewski. In Geophysical Research Abstracts. EGU General
Assembly, volume 20, Vienna, Austria, April 2018.
Ice core archives provide the most direct and detailed evidence
of past climate and atmospheric conditions. How- ever, the
resolution of traditional ice core sampling methods limits the
scope of information that can be extracted from the ice
regarding meteorological events (e.g., dust storms, volcanic
eruptions, anthropogenic emissions) that are captured at
inter-annual to sub-annual scales. Using laser ablation
inductively coupled mass spectrometry (LA- ICP-MS), a novel
ultra-high-resolution multi-element sampling method for ice
cores, we recovered the highest- resolution continuous
glacio-chemical record yet from an ice core, measuring close to
5 million samples from 40 meters of core. This unique record
was compiled using samples from the 2013 Colle Gnifetti ice
core, located in the Swiss-Italian Alps. Here we present the
first results from a new approach to high-resolution ice core
data analysis through a new array of statistical tools, data
processing algorithms and statistical machine learning tools
adapted for ice core data sets. Our new data processing
framework is designed to detect, extract and synthesize
environmental signals from ultra-high-resolution
glacio-chemical time series in concert with more traditional
ice core sampling data to further refine paleoenvironmental
signals. The authors gratefully acknowledge the Climate Change
Institute at the University of Maine, funding from grant AC3862
of the Arcadia Fund and NSF grant PLR-1443306.
-
Java unit annotations for
units-of-measurement error prevention,. Mark E. Royer and
Sudarshan S. Chawathe. In Proceedings of the 8th IEEE Annual
Computing and Communication Workshop and Conference (IEEE CCWC
2018), pages 858-864, Las Vegas, Nevada, January 2018.
This project is a Java library for representing measurement
units that provides easier avoidance and detection of a
significant source of errors in scientific code. The technique
uses the Java virtual-machine's class-loading extensions and
annotations with run-time retention policies to enforce units
conformance and conversion at run time. Analysis of the Java
bytecode is performed at run time (or possibly compile time) to
check conformance and conversion of unit-annotated types.
-
Lexical Text Segmentation Using
Dictionaries. Sudarshan S. Chawathe. In Proceedings of the
8th IEEE Annual Computing and Communication Workshop and
Conference (IEEE CCWC 2018), pages 56-62, Las Vegas, Nevada,
January 2018.
Text segmentation refers to the task of partitioning text into
disjoint segments based on some matching and optimization
criteria. Examples include partitioning text into words,
graphemes, and phonemes. The problem is especially challenging
when the language does not require spaces, punctuation, or
other simple separators; when segments may be combined in
nontrivial ways; and in the presence of errors in transcription
or recognition. This paper focuses on a purely lexical method
of segmentation: Text is segmented using only a dictionary of
known words along with a compatible cost function. No
grammatical or other higher-level knowledge is used. The method
uses efficient algorithms for multiple-string matching, such as
the classic Aho-Corasick algorithm, to yield significant
improvements in running time when compared with a simpler
dynamic programming algorithm. An experimental study compares
the running times of the dictionary-based and dynamic
programming algorithms.
-
Compact Representations of
Character-Sets. Sudarshan S. Chawathe. In Proceedings of
the 8th IEEE Annual Computing and Communication Workshop and
Conference (IEEE CCWC 2018), pages 49-55, Las Vegas, Nevada,
January 2018.
Text segmentation refers to the task of partitioning text into
disjoint segments based on some matching and optimization
criteria. Examples include partitioning text into words,
graphemes, and phonemes. The problem is especially challenging
when the language does not require spaces, punctuation, or
other simple separators; when segments may be combined in
nontrivial ways; and in the presence of errors in transcription
or recognition. This paper focuses on a purely lexical method
of segmentation: Text is segmented using only a dictionary of
known words along with a compatible cost function. No
grammatical or other higher-level knowledge is used. The method
uses efficient algorithms for multiple-string matching, such as
the classic Aho-Corasick algorithm, to yield significant
improvements in running time when compared with a simpler
dynamic programming algorithm. An experimental study compares
the running times of the dictionary-based and dynamic
programming algorithms.
-
Annotating Unit Functions in the Climate
Data Workbench. Mark Royer, Sudarshan S. Chawathe, Andrei
V. Kurbatov, and Paul A. Mayewski. In Proceedings of the Borns
Symposium, April 2017.
We describe a method for representing measurement units for the
Climate Data Workbench, providing easier avoidance and
detection of a significant source of errors in scientific code.
Our method uses the Java virtual-machine's class-loading
extensions, and annotations with runtime retention policies, to
enforce units conformance and conversion at runtime.
-
Functional-programming with Generic
Mapping Tools (fGMT). Sudarshan S. Chawathe. In
Proceedings of the Borns Symposium, April 2017.
We describe fGMT, a functional interface to the very popular
GMT collection of mapping and plotting tools. Our
implementation uses scsh Scheme and is designed to permit
incremental building of higher-level interfaces that
incorporate domain-specific knowledge.
-
The P301 Web API. Mark Royer,
Sudarshan S. Chawathe, Andrei V. Kurbatov, and Paul A. Mayewski.
In Proceedings of the Borns Symposium, April 2016.
The P301 Web API is a RESTful interface that allows P301 users
to share data that have been uploaded to the P301 system. The
system supports accessing data in JavaScript Object Notation
(JSON) and Extensible Markup Language (XML) formats, which
helps to facilitate the development of Web-based applications.
A variety of queries for accessing the data in the system
allows for flexibility in client system designs.
-
Toward a Domain-Specific Language for
Patterns in Ice-Core Data. Sudarshan S. Chawathe. In
Proceedings of the Borns Symposium, April 2016.
We describe a language for expressing simple patterns in time
series data derived from ice-cores and similar sources. Such
patterns use simpler features mapped to tokens by an earlier
phase of analysis. In turn, they allow more complex features to
be expressed and analyzed.
-
Interactive Exploration of Time Lines
from Ice Core Data Sets. Sudarshan S. Chawathe. In
Proceedings of the Borns Symposium, April 2015.
Time lines are derived from ice core data typically by counting
layers or peaks in sequences of measured values. This work (in
progress) explores the extent to which automation and
interactive exploration may assist this task.
-
Deploying a Multi-Interface RESTful
Application in the Cloud. Erik Albert and Sudarshan S.
Chawathe. In Proceedings of the 6th International Conference on
Data Management in Cloud, Grid and P2P Systems (Globe-13),
Prague, Czech Republic, August 2013.
This paper describes the design, implementation, and deployment
of an application server whose primary infrastructure is an
elastic cloud of servers. The design is based on the
Representational State Transfer (REST) style, which provides
significant benefits in a cloud environment. The paper also
addresses implementation issues within a specific cloud service
and highlights key decisions and their effect on scalability and
cost. Finally, it describes our experiences in deploying a
widely used platform with both Web and mobile client interfaces
and its ability to cope with load spikes while maintaining a
low quiescent cost.
-
Fast Fingerprinting for File-System
Forensics. Sudarshan S. Chawathe. In In Proceedings of the
12th annual IEEE Conference on Technologies for Homeland Security
(HST), pages 591-596, Waltham, Massachusetts, November 2012.
An important method used to speed up forensic file-system
analysis is white-listing of files: Well-known files are detected
using signatures (message digests) or similar methods, and
omitted from further analysis initially, in order to better
focus the initial analysis on files likely to be more important.
Typical examples of such well-known files include files used by
operating systems, popular applications, and software
libraries. This paper presents methods for improving the
effectiveness and efficiency of such signature-based white-listing
during file-system forensics. One concern for effectiveness is
the resilience of the white-listing method to an adversary who
has complete knowledge of the method and who may make small,
inconsequential changes to a large number of well-known files on
a target file-system in order to overload the analysis and
thereby practically defeat it. Another concern is the ability
to detect near-matches in addition to exact matches. Efficiency
refers to primarily the rate at which a target file system may
be processed during analysis; preparation-time, or indexing,
efficiency is a lesser concern as that computation may be
performed during non-critical times. Our work builds on
techniques such as locality-sensitive hashing to yield an
effective filter for further analysis tools.
-
Managing Diverse Data Sets Using
P301. Mark Royer, Sudarshan S. Chawathe, Andrei V.
Kurbatov, and Paul A. Mayewski. In Proceedings of the 20th annual
Harold W. Borns, Jr. Symposium, Orono, Maine, April 2012.
The integration and analysis of data sets from diverse sources
provides scientists with an opportunity to gain insights that
are not apparent from the individual data sets or sources. For
many sources, improving technology and other factors have
resulted in a very rapid growth in both the volume and the
diversity of data. This wealth of data has the potential for
significant scientific breakthroughs. However, this potential is
difficult to realize unless there is a systematic and effective
method for managing this data. The methods used by researchers
in the past typically do not scale up to current and
anticipated levels of data volume and diversity. The P301
project addresses this problem with the goal of accelerating
the data flow from data sources to research results. Below, we
outline one aspect of this work: Managing the syntactic and
semantic consistency of data using an interactive framework
that eases the task of importing, cleaning, analyzing, and
visualizing data, and of recording such data transformations
and results using histories and certificates.
-
Deploying a Highly Scalable Web
Application in the Cloud. Erik Albert and Sudarshan S.
Chawathe. In Proceedings of the 20th annual Harold W. Borns, Jr.
Symposium, Orono, Maine, April 2012.
The 10Green Web application integrates air quality data from
diverse sources and provides an intuitive interface that
summarizes this information in a manner accessible to
scientists and non-scientists alike. From a Computer Science
perspective, this application presents interesting challenges
in both the back end (e.g., data integration and analysis,
maintainability) and the front end (e.g., Web-based
visualization, interactive response times, and portability
across very diverse client architectures). Here, we focus on
scalability and outline the implementation aspects that allow
the application to scale from a few hundred users to hundreds
of thousands of concurrent users at low cost.
-
A REST Framework for Dynamic Client
Environments. Erik Albert and Sudarshan S. Chawathe. In
Erik Wilde and Cesare Pautasso, editors, REST: From Research to
Practice, chapter 10. Springer, 1st edition, August 2011. ISBN
978-1-4419-8302-2.
The REST Framework for Dynamic Client Environments (RFDE) is a
method for building RESTful Web applications that fully exploit
the diverse and rich feature-sets of modern client environments
while retaining functionality in the absence of these features.
For instance, we describe how an application may use a modern
JavaScript library to enhance interactivity and end-user
experience while also maintaining usability when the library is
unavailable to the client (perhaps due to incompatible
software). These methods form a framework that we have
developed as part of our work on a Web application for
presenting large volumes of scientific datasets to
nonspecialists.
-
A low-cost scalable Web mapping service
for climate data. Erik Albert and Sudarshan S. Chawathe.
In Proceedings of the 19th annual Harold W. Borns, Jr. Symposium,
Orono, Maine, May 2011.
We describe the design and implementation of a method to serve
hundreds of terabytes of image data (tiles) for a Web-based
mapping service. The method allows the service to scale
gracefully from a few dozen to thousands of concurrent
connections. Map tiles are stored in implicit form in a
database and the corresponding bit-mapped images are computed
as needed using an efficient stored-procedure implementation. The
implementation is also particularly well suited to deployment
in the cloud computing environment.
-
P301dx: Interactive data analysis.
Mark Royer, Sudarshan S. Chawathe, Andrei V. Kurbatov, and Paul
A. Mayewski. In Proceedings of the 19th annual Harold W. Borns,
Jr. Symposium, Orono, Maine, May 2011.
The Project 301 Data Explorer, P301dx, is a software workbench
for climate-change data. It aids scientists with the tasks of
storing, integrating, sharing, analyzing, and visualizing such
data. The primary goal of Project 301 is improving the efficiency
and effectiveness of the process of transforming raw data into
easily interpretable scientific results.
- Information Systems for
Passenger Guidance in Transit Systems, Sudarshan S.
Chawathe. Invited presentation at the Symposium on Engineering and
Technologies for the Metro Bogota (Metrosimposio), May 2010
- Low-Latency Indoor
Localization Using Bluetooth Beacons. Sudarshan S. Chawathe.
In Proceedings of the 12th International IEEE Conference on
Intelligent Transportation Systems (ITSC), St. Louis, Missouri,
October 2009
- Effective Whitelisting
for Filesystem Forensics. Sudarshan S. Chawathe. In
Proceedings of the 7th IEEE Intelligence and Security Informatics
Conference (ISI), Richardson, Texas, June 2009
- Beacon Placement for
Indoor Localization using Bluetooth. Sudarshan S. Chawathe.
In Proceedings of the 11th International IEEE Conference on
Intelligent Transportation Systems (ITSC), pages 980-985, Beijing,
China, October 2008
- Using Dead Drops to
Improve Data Dissemination in Very Sparse Equipped Traffic.
Sudarshan S. Chawathe. In Proceedings of the IEEE Intelligent
Vehicles Symposium (IV), pages 962-967, Eindhoven, Netherlands,
June 2008
- Protecting
Transportation Infrastructure. Daniel Zeng, Sudarshan S.
Chawathe, Hua Huang, and Fei-Yue Wang. IEEE
Intelligent Systems, 22(5):8-11, September/October 2007
- Marker-Based Localizing
for Indoor Navigation. Sudarshan S. Chawathe. In Proceedings
of the 10th International IEEE Conference on Intelligent
Transportation Systems (ITSC), pages 885-890, Seattle, Washington,
October 2007
- Interimistic Data
Dissemination. Sudarshan S. Chawathe and Abheek Anand.
Information Systems and e-Business Management (ISeB),
5(3):229-253, June 2007
- Segment-Based Map
Matching. Sudarshan S. Chawathe. In Proceedings of the IEEE
Intelligent Vehicles Symposium (IV), pages 1190-1197, Istanbul,
Turkey, June 2007
- Organizing Hot-Spot
Police Patrol Routes. Sudarshan S. Chawathe. In Proceedings
of the 5th IEEE Intelligence and Security Informatics Conference
(ISI), pages 78-85, New Brunswick, New Jersey, May 2007
- Protecting
Transportation Infrastructure. Daniel Zeng, Sudarshan S.
Chawathe, and Fei-Yue Wang. IEEE Intelligent
Transportation Systems Society Newsletter, 2007. Republished
as a selected paper from the IEEE Intelligent Systems
- Inter-Vehicle Data
Dissemination in Sparse Equipped Traffic. Sudarshan S.
Chawathe. In Proceedings of the 9th International IEEE Conference
on Intelligent Transportation Systems (ITSC), pages 273-280,
Toronto, Canada, September 2006
- Strategic Web-Service
Agreements. Sudarshan S. Chawathe. In Proceedings of the 4th
IEEE International Conference on Web Services (ICWS), pages
119-126, Chicago, Illinois, September 2006
- Tracking Changes in
Healthcare Documents. Sudarshan S. Chawathe. In Proceedings
of the 19th IEEE International Symposium on Computer-Based Medical
Systems (CBMS), pages 137-142, Salt Lake City, Utah, June 2006
- Distributing the Cost
of Securing a Transportation Infrastructure. Sudarshan S.
Chawathe. IEEE Intelligent Transportation
Systems Society Newsletter, 8(2):17-21, June 2006.
Republished as one of two selected papers from the ISI-2006
conference
- Distributing the Cost
of Securing a Transportation Infrastructure. Sudarshan S.
Chawathe. In Proceedings of the 4th IEEE Intelligence and Security
Informatics Conference (ISI), pages 596-601, San Diego, California,
May 2006
- Fair Policies for
Travel on Neighborhood Streets. Sudarshan S. Chawathe. In
Proceedings of the 8th International IEEE Conference on Intelligent
Transportation Systems (ITSC), pages 1027-1032, Vienna, Austria,
September 2005
- Book Review:
Perspectives on Intelligent Transportation Systems.
Sudarshan S. Chawathe. IEEE Intelligent
Transportation Systems Society Newsletter, 7(3):14-15,
September 2005
- Differencing Data
Streams. Sudarshan S. Chawathe. In Proceedings of the 9th
International Database Engineering and Applications Symposium
(IDEAS), pages 273-284, Montreal, Canada, July 2005
- XSQ: A Streaming XPath
Engine. Feng Peng and Sudarshan S. Chawathe. ACM Transactions on Database
Systems (TODS), 30(2):577-623, June 2005
- Data Management in
Interimistic Environments. Abheek Anand and Sudarshan S.
Chawathe. In Proceedings of the Third Workshop on E-Business (WeB),
Washington, D.C., December 2004
- Real-Time Traffic-Data
Analysis. Sudarshan S. Chawathe. In Proceedings of the 7th
IEEE International Conference on Intelligent Transportation Systems
(ITSC), pages 112-117, Washington, D.C., October 2004
- Control of Personal
Location Data. Sudarshan S. Chawathe. In Proceedings of the
Location Privacy Workshop, Schoodic Peninsula, Acadia National
Park, Maine, August 2004
- Managing RFID
Data. Sudarshan S. Chawathe, Venkat Krishnamurthy, Sridhar
Ramachandran, and Sanjay Sarma. In Proceedings of the 30th
International Conference on Very Large Data Bases (VLDB), pages
1189-1195, Toronto, Canada, August 2004
- Privacy-Preserving
Inter-Database Operations. Gang Liang and Sudarshan S.
Chawathe. In Proceedings of the Symposium on Intelligence and
Security Informatics (ISI), volume 3073 of Lecture Notes in Computer
Science (LNCS), pages 66-82, Tucson, Arizona, June 2004
- Skipping Streams with
XHints. Akhil Gupta and Sudarshan S. Chawathe. Technical
Report CS-TR-4566, Computer Science Department, University of
Maryland, College Park, Maryland, February 2004
- Privacy-Preserving
Inter-Database Operations. Gang Liang and Sudarshan S.
Chawathe. Technical Report CS-TR-4564 (UMIACS-TR-2004-09),
University of Maryland, College Park, February 2004
- Efficient Peer-to-Peer
Namespace Searches. Vijay Gopalakrishnan, Bobby
Bhattacharjee, Sudarshan S. Chawathe, and Pete Keleher. Technical
Report CS-TR-4568, University of Maryland, College Park, Maryland,
February 2004
- Cooperative Data
Dissemination in a Serverless Environment. Abheek Anand and
Sudarshan S. Chawathe. Technical Report CS-TR-4562, Computer
Science Department, University of Maryland, College Park, Maryland,
January 2004
- XPaSS: A Multiple-Query
Streaming XPath Query Engine. Feng Peng and Sudarshan S.
Chawathe. Technical Report CS-TR-4565, Computer Science Department,
University of Maryland, College Park, Maryland, January 2004
- Streaming XPath
Subquery Evaluation. Feng Peng and Sudarshan S. Chawathe.
Technical Report CS-TR-4560, Computer Science Department,
University of Maryland, College Park, Maryland, January 2004
- Optimal Buffering for
Streaming XPath Evaluation. Feng Peng and Sudarshan S.
Chawathe. Technical Report CS-TR-4561, Computer Science Department,
University of Maryland, College Park, Maryland, January 2004
- Semistructured Data in
Relational Databases. Sudarshan S. Chawathe, chapter 25,
pages 1-19. Practical Handbook of Internet Computing. CRC Press,
2004
- XSQ: A Streaming XPath
Engine. Feng Peng and Sudarshan S. Chawathe. Technical
Report CS-TR-4493, Department of Computer Science, University of
Maryland, May 2003
- XPath Queries on
Streaming Data. Feng Peng and Sudarshan S. Chawathe. In
Proceedings of the ACM SIGMOD International Conference on
Management of Data (SIGMOD), pages 431-442, San Diego, California,
June 2003
- Tracking Hidden Groups
Using Communications. Sudarshan S. Chawathe. In Proceedings
of the NSF/NIJ Symposium on Intelligence and Security Informatics
(ISI), volume 2665 of Lecture Notes
in Computer Science (LNCS), pages
195-208, Tucson, Arizona, June 2003
- XSQ: Streaming XPath
Queries. Feng Peng and Sudarshan S. Chawathe. In Proceedings
of the 19th International Conference on Data Engineering (ICDE),
pages 780-782, Bangalore, India, March 2003. Demonstration
description
- Efficient Peer-to-Peer
Searches Using Result-Caching. Bobby Bhattacharjee,
Sudarshan S. Chawathe, Vijay Gopalkrishnan, Pete Keleher, and Bujor
Silaghi. In Proceedings of the International Workshop on
Peer-to-Peer Systems (IPTPS), pages 225-236, Berkeley, California,
February 2003
- Managing Historical XML
Data. Sudarshan S. Chawathe, volume 57 of Advances in Computers, chapter 3, pages 109-169.
Elsevier Science, 2003
- SEuS: Structure
Extraction using Summaries. Shayan Ghazizadeh and Sudarshan
S. Chawathe. In Steffen Lange, Ken Satoh, and Carl H. Smith,
editors, Proceedings of the 5th International Conference on
Discovery Science, volume 2534 of Lecture
Notes in Computer Science (LNCS), pages 71-85, Lubeck,
Germany, November 2002
- Tracking Moving
Clutches in Streaming Graphs. Sudarshan S. Chawathe.
Technical Report CS-TR-4376, Computer Science Department,
University of Maryland, College Park, Maryland, October 2002
- XSQ: Streaming XPath
Queries. Feng Peng and Sudarshan S. Chawathe. Technical
Report CS-TR-4401 (UMIACS-TR-2002-81), Computer Science Department,
University of Maryland, College Park, Maryland, September 2002
- Discovering Freuqent
Structures using Summaries. Shayan Ghazizadeh and Sudarshan
Chawathe. Technical report, University of Maryland, Computer
Science Department, 2002
- Discovering Frequent
Structures using Summaries. Shayan Ghazizadeh and Sudarshan
S. Chawathe. Technical Report CS-TR-4364, Computer Science
Department, University of Maryland, College Park, Maryland,
November 2001
- VQBD: Exploring
Semistructured Data. Sudarshan S. Chawathe, Thomas Baby, and
Jihwang Yeo. In Proceedings of the ACM SIGMOD International
Conference on Management of Data (SIGMOD), page 603, Santa Barbara,
California, May 2001. Demonstration description
- VQBD: Visualizing,
Querying, and Browsing Semistructured Data, Sudarshan S.
Chawathe, Thomas Baby, and Jihwang Yeo, November 2000. Extended
version of demonstration description. http://cs.umaine.edu/~chaw/
- Comparing Hierarchical
Data in External Memory. Sudarshan S. Chawathe. In
Proceedings of the International Conference on Very Large Data
Bases (VLDB), pages 90-101, Edinburgh, Scotland, September
1999
- Describing and
Manipulating XML Data. Sudarshan S. Chawathe. Bulletin of the IEEE Technical Committee on Data Engineering,
22(3):3-9, September 1999
- Managing Historical
Semistructured Data. Sudarshan S. Chawathe, Serge Abiteboul,
and Jennifer Widom. Theory and Practice of
Object Systems, 5(3):143-162, August 1999
- Managing Change in
Heterogeneous Autonomous Databases. Sudarshan S. Chawathe.
PhD thesis, Stanford University, 1999
- Representing and
Querying Changes in Semistructured Data. Sudarshan S.
Chawathe, Serge Abiteboul, and Jennifer Widom. In Proceedings of
the International Conference on Data Engineering (ICDE), pages
4-13, Orlando, Florida, February 1998
- An Expressive Model for
Comparing Tree-Structured Data. Sudarshan S. Chawathe and
Hector Garcia-Molina. Technical report, Stanford University
Database Group, November 1997
- Representing and
Querying Changes in Heterogeneous Semistructured Databases
(Demonstration Description). S.
Chawathe, V. Gossain, X. Liu, J Widom, and S. Abiteboul. Technical
report, Stanford University Database Group, November 1997.
Available at http://www-db.stanford.edu
- Meaningful Change
Detection in Structured Data. Sudarshan S. Chawathe and
Hector Garcia-Molina. In Proceedings of the ACM SIGMOD
International Conference on Management of Data (SIGMOD), pages
26-37, Tuscon, Arizona, May 1997
- Representing and
Querying Changes in Semistructured Data (Extended Version),
S. Chawathe, S. Abiteboul, and J. Widom. Available at http://www-db.stanford.edu, 1997
- Meaningful Change
Detection in Structured Data, S. Chawathe and H.
Garcia-Molina. Available at http://www-db.stanford.edu/, 1997. Extended
version
- Representative Objects:
Concise Representations of Semistructured, Hierarchial Data.
Svetlozar Nestorov, Jeffrey D. Ullman, Janet Wiener, and Sudarshan
S. Chawathe. In Proceedings of the International Conference on Data
Engineering (ICDE), pages 79-90, Birmingham, U.K., 1997
- Change Detection in
Hierarchically Structured Information. Sudarshan S.
Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer
Widom. In Proceedings of the ACM SIGMOD International Conference on
Management of Data (SIGMOD), pages 493-504, Montréal, Québec, June
1996
- A standard textual
interchange format for the Object Exchange Model (OEM). Roy
Goldman, Sudarshan S. Chawathe, Arturo Crespo, and Jason McHugh.
Technical report, Stanford University Database Group, June
1996
- A Toolkit for
Constraint Management in Heterogeneous Information Systems.
Sudarshan S. Chawathe, Hector Garcia-Molina, and Jennifer Widom. In
Proceedings of the International Conference on Data Engineering
(ICDE), pages 56-65, New Orleans, Louisiana, 1996
- Change Detection in
Hierarchically Structured Information. S. Chawathe, A.
Rajaraman, H. Garcia-Molina, and J. Widom. Technical report, Dept.
of Computer Science, Stanford University, 1995. Available at
http://www-.stanford.edu
- Change Detection in
Hierarchically Structured Information. S. Chawathe, A.
Rajaraman, H. Garcia-Molina, and J. Widom. Technical report,
Stanford University Database Group, 1995. Available at http://www-db.stanford.edu
-
The Tsimmis Project: Integration of
Heterogeneous Information Sources. S. Chawathe, H.
Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J.
Ullman, and J. Widom. In Proceedings of 100th Anniversary Meeting
of the Information Processing Society of Japan, pages 7-18,
Tokyo, Japan, October 1994.
The goal of the Tsimmis Project is to develop tools that
facilitate the rapid integration of heterogeneous information
sources that may include both structured and unstructured data.
This paper gives an overview of the project, describing
components that extract properties from unstructured objects,
that translate information into a common object model, that
combine information from several sources, that allow browsing
of information, and that manage constraints across
heterogeneous sites. Tsimmis is a joint project between
Stanford and the IBM Almaden Research Center.
- The Tsimmis Project:
Integration of Heterogeneous Information Sources. Sudarshan
S. Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland,
Yannis Papakonstantinou, Jeffrey D. Ullman, and Jennifer Widom. In
Proceedings of 100th Anniversary Meeting of the Information
Processing Society of Japan, pages 7-18, Tokyo, Japan, October
1994
- Flexible Constraint
Management for Autonomous Distributed Databases. Sudarshan
S. Chawathe, Hector Garcia-Molina, and Jennifer Widom. Data Engineering Bulletin, 17(2):23-27, June
1994
- Constraint Management
in Loosely Coupled Distributed Databases. Sudarshan S.
Chawathe, Hector Garcia-Molina, and Jennifer Widom. Technical
report, Computer Science Department, Stanford University, 1994.
Available at http://www-db.stanford.edu
- On Index Selection
Schemes for Nested Object Hierarchies. Sudarshan S.
Chawathe, Ming-Syan Chen, and Philip S. Yu. In Proceedings of the
International Conference on Very Large Data Bases (VLDB), pages
331-341, Santiago de Chile, 1994
- Constraint Management
in Loosely Coupled Distributed Databases. Sudarshan S.
Chawathe, Hector Garcia-Molina, and Jennifer Widom. Technical
report, Computer Science Department, Stanford University, 1993.
Available at http://www-db.stanford.edu