It is a topic you can not avoid when working with data: data quality. It is due to the law of bad data - there is always more of it. You have to deal with bad data.
Like Andrew Ng (co-founder of Coursera, Google Brain, deeplearning.ai, landing.ai) who recently started a data-centric AI competition. The motivation for this competition: if you want to improve the performance of your machine learning model, then focus on engineering your data set because “high-performance model architectures are widely available”.
Data quality 101
The essence of data quality is simple: you have assumptions about the data which you want to validate. If you do not validate your assumptions, your data product is likely to show unexpected behavior.
A common starting point is to add data tests to your data processing pipeline:
1 assert (0 <= df["percentage_column"] <= 100).all()
This is good! It is a good start to add these checks. You will see that even the most basic checks fail - in the beginning.
In a professional setting you are working within a team that operates within an organization. It is not just you and your pipelines. In such a setting we need a tool which makes the data tests visible. This makes it easier to:
- update the data tests as your data models change;
- document and share knowledge about the data models;
- inform downstream products that rely on your data models when tests do not pass.
Get ahead of silent data issues
The problem with the example above is that we can not formulate all data tests in such a manner. For example, let’s think of a data test for the number of rows that are processed:
- a negative row count is impossible by definition (something would be very off!);
- most likely you want it to be non-zero;
- and maybe it should not be too large.
1 assert 0 < df.count() <= 1_000_000_000
Ok, this is a rudimentary test, but… what if the row count of your daily processing batch suddenly drops with 90% while it was mostly stable before???
I would like to receive a warning about this!!!
These kind of issues are silent data issues. These issues are not seen by looking at one batch of data at a specific point in time. It is something that is detected when looking at multiple data batches over time.
They offer various features that support your data quality workflow. I will talk more about the data quality workflow in another blog post.