Often, data preparation is more time-consuming than working with the prepared data.
Challenges
Typical challanges include:
- Finding all relevant data sources
- When data sources are found: connecting to them
- Data needs to be reshaped when data sources are finally connected to
- Automate data reshaping process because typically the reshaping of data is not a one-off.
- Large time sets whose processing time delays quality testing
One of the major challenges that data preparation addresses is the
heterogeneity of
data in most but very small organizations: Data is stored in various
data formats in different data stores (
databases,
Excel,
SAS, etc.) and needs to be merged into a format that is useful for further processing.
In order to merge the data, the same entities (such as for example a customer) that are stored across different databases must be able to be identified as such. In an ideal world, there would be a
primary key. However, because data is stored in different formats for different purposes, it turns out that usually there is not
one primary key, rather, the same entities are identitifed differently in different data sources, especially when
surrogate keys are used.
Antother challenge I often observe are outliers, that is: data with unusual or (seemingly) unrealistic values. It is often not evident when a value is in fact an outlier or if unrealist values can be explained with more domain knowledge.
Privacy violation
data preparation might uncover confidential or private information or to expose data that identifies a specific person.
Therefore, data should be anonymized (aka de-identified) before or during data preparation.