Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality…mehr
Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning. In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks.
Venky Ganti is the co-founder and CTO of Alation Inc, where he is developing technology to effectively search, understand, and analyze structured and semi-structured data. Prior to Alation, he was a member of the Google Adwords engineering team for a few years. He helped develop the Dynamic Search Ads (DSA) product, whose goal is to completely automate the configuration and maintenance of AdWords campaigns based on an advertiser's website and a few configuration parameters. ¿e main technical challenge is to mine for appropriate keywords and automatically create high quality ads which match the accuracy and quality of manually configured campaigns. Prior to Google, Venky was a senior researcher at Microsoft Research (MSR). While at MSR, he worked extensively on data cleaning and integration technologies. Some of the technologies he helped develop in this context are now part of Microsoft SQL Server Integration Services, the ETL platform of Microsoft SQL Server. He also worked on leveraging rich structured databases on products, movies, people, etc., to enrich user experience for web search. Some of the tech nologies he helped develop are now part of the Bing product search. He has a Ph.D. in database systems and data mining from the University of Wisconsin-Madison. Anish Das Sarma is currently a Senior Research Scientist at Google (since May 2010), before which he was a Research Scientist at Yahoo (August 2009-April 2010). Prior to joining Yahoo research, Anish did his Ph.D. in Computer Science at Stanford University, advised by Prof. Jen nifer Widom. Anish received a B.Tech. in Computer Science and Engineering from the Indian Institute of Technology (IIT) Bombay in 2004, and an M.S. in Computer Science from Stan ford University in 2006. Anish is a recipient of the Microsoft Graduate Fellowship, a Stanford University School of Engineering fellowship, and the IIT-Bombay Dr. Shankar Dayal Sharma Gold Medal. Anish has written over 40 technical papers, filed over 10 patents, is associate edi tor of Sigmod Record, has served on the thesis committee of a Stanford Ph.D. student, and has served on numerous program committees. Two SIGMOD and one VLDB paper co-authored by Anish were selected among the best papers of the conference, with invitations to journals. While at Stanford, Anish co-founded Shout Velocity, a social tweet ranking system that was named a top-50 fbFund Finalist for most promising upcoming start-up ideas