Data preparation

Challenges

Typical challanges include:

Finding all relevant data sources
When data sources are found: connecting to them
Data needs to be reshaped when data sources are finally connected to
Automate data reshaping process because typically the reshaping of data is not a one-off.
Large time sets whose processing time delays quality testing

One of the major challenges that data preparation addresses is the heterogeneity of data in most but very small organizations: Data is stored in various data formats in different data stores (databases, Excel, SAS, etc.) and needs to be merged into a format that is useful for further processing.

In order to merge the data, the same entities (such as for example a customer) that are stored across different databases must be able to be identified as such. In an ideal world, there would be a primary key. However, because data is stored in different formats for different purposes, it turns out that usually there is not one primary key, rather, the same entities are identitifed differently in different data sources, especially when surrogate keys are used.

Antother challenge I often observe are outliers, that is: data with unusual or (seemingly) unrealistic values. It is often not evident when a value is in fact an outlier or if unrealist values can be explained with more domain knowledge.

Yet another challenge are null values because they can be interpreted in at least two ways: unknown and none.

Privacy violation

data preparation might uncover confidential or private information or to expose data that identifies a specific person.

Therefore, data should be anonymized (aka de-identified) before or during data preparation.

Data preparation

Challenges

Privacy violation

See also