Soda - get ahead of silent data issues

Posted by Cor on September 27, 2021

It is a topic you can not avoid when working with data: data quality. It is due to the law of bad data - there is always more of it. You have to deal with bad data.

Like Andrew Ng (co-founder of Coursera, Google Brain, deeplearning.ai, landing.ai) who recently started a data-centric AI competition. The motivation for this competition: if you want to improve the performance of your machine learning model, then focus on engineering your data set because “high-performance model architectures are widely available”.

At GoDataDriven we partnered with Soda to add a data quality tool to our data platform proposition.

Full disclosure: I develop the soda-spark package. See my bio to read more about what I do.

Data quality 101

The essence of data quality is simple: you have assumptions about the data which you want to validate. If you do not validate your assumptions, your data product is likely to show unexpected behavior.

A common starting point is to add data tests to your data processing pipeline:

1
assert (0 <= df["percentage_column"] <= 100).all()

This is good! It is a good start to add these checks. You will see that even the most basic checks fail - in the beginning.

Data observability

In a professional setting you are working within a team that operates within an organization. It is not just you and your pipelines. In such a setting we need a tool which makes the data tests visible. This makes it easier to:

  1. update the data tests as your data models change;
  2. document and share knowledge about the data models;
  3. inform downstream products that rely on your data models when tests do not pass.

There are various data quality tools that help you with this, but I want to introduce you to one in specific: soda-sql.

Get ahead of silent data issues

The problem with the example above is that we can not formulate all data tests in such a manner. For example, let’s think of a data test for the number of rows that are processed:

  1. a negative row count is impossible by definition (something would be very off!);
  2. most likely you want it to be non-zero;
  3. and maybe it should not be too large.
1
assert 0 < df.count() <= 1_000_000_000

Ok, this is a rudimentary test, but… what if the row count of your daily processing batch suddenly drops with 90% while it was mostly stable before???

I would like to receive a warning about this!!!

These kind of issues are silent data issues. These issues are not seen by looking at one batch of data at a specific point in time. It is something that is detected when looking at multiple data batches over time.

Introducing Soda

This is where Soda comes into play. Their (paid) Soda cloud allows you to detect these silent data issues by looking at the difference between data scans.

anomaly detection

They offer various features that support your data quality workflow. I will talk more about the data quality workflow in another blog post.