Data quality is a serious concern in any organization that relies on data. The quality of data is commonly poor due to a multitude of reasons including, but not limited to, spelling mistakes, abbreviations, lack of standards and inconsistent notations.
SPIDER is a declarative data cleaning tool. It incorporates a set of algorithms that can be used to aid the improvement of data quality on any relational data source. The main advantage of SPIDER is that it is based purely on declarative methodologies, thus it can be used in any relational data source, and it does not rely on standardized dictionaries or libraries of “clean data.” The main features of this technology include a vast collection of similarity measures expressed fully in SQL, heavy use of sampling for improved performance, statistical schema matching and statistical methodologies for data analysis and identification of quality problems.
For more information on SPIDER, please visit the project site.
Nick Koudas Faculty
Mohammad Sadoghi Graduate Student
More Research Profiles