|
Speaker: Daisy Zhe Wang
University of California, Berkeley
Title: Querying Probabilistic Information Extraction
Abstract: Recently there has been significant interest in extending database
systems to deal with probabilistic information. Typical approaches
attach some notion of uncertainty to data at the record and/or field
level. Such approaches are limited in their ability to represent
probabilistic correlations and thus, often require statistical
inference to be performed outside of the database, leading to
inefficient performance and inaccurate results. Instead we advocate for
a closer integration of Statistical Machine Learning models into the
database system itself.
In this talk, I will first describe the BayesStore project which is
developing such an integrated probabilistic database system. I will
also discuss a number of applications where such an integrated system
would be particularly useful, including sensor data analytics,
information extraction, intrusion detection systems, etc. For the rest
of the talk, I will be focusing on one particular application --
Information Extraction (IE) to enable relational query processing to
include data obtained from unstructured sources. Compared to approaches
that use IE techniques outside of the database, we show that an
integrated approach in which the statistical model and inference is
supported natively in the database, and optimized with the relational
operators can provide improvements in answer quality and efficiency. I
will describe an implementation of these ideas that provides a
query-oriented language for specifying, optimizing, and executing IE
tasks, and supports a principled probabilistic framework for querying
the outputs of those tasks.
Bio:
Daisy Zhe Wang is a Ph.D. student in Computer Science department at
UC Berkeley. She is a member of the database research group and
RAD-Lab. She received her B.A.Sc. degree with honors from University of
Toronto in 2005. Her research focuses on data management systems that
support scalable, declarative, on-line data analytics based on
Statistical Machine Learning models. She collaborated with Yahoo!
Research, IBM Research at Almaden, and Intel Research on probabilistic
information management. She also had industrial experience at Google
and IBM Toronto Lab.
Hosted by: Renee Miller
|