Search notes:

Data cleaning

Data quality might suffer because
In order for the data to be useful, it needs to be purged from its problems.

Steps

Data is cleaned in three steps:
Discrepancies to look out for are

Iteration

Usually, data need to be cleaned in iterations: after resolving a particular data problem, it usually unhides problems that lie deeper.

Misc

Lakshmanan, Sadri, and Subramanian proposed 1996 an extension to SQL (SchemaSQL) that allows to operate on messy datasets,
Raman and Hellerstein provides a framework for cleaning datasets («Potter's Wheel») (2001)
Kandel, Paepcke, Hellerstein and Heer developped an interactive tool with a friendly user interface which automatically creates code to clean data (2011).

See also

Data preparation

Index