# Soda - get ahead of silent data issues

Posted by Cor on September 27, 2021

It is a topic you can not avoid when working with data: data quality. It is due to the law of bad data - there is always more of it. You have to deal with bad data.

Like Andrew Ng (co-founder of Coursera, Google Brain, deeplearning.ai, landing.ai) who recently started a data-centric AI competition. The motivation for this competition: if you want to improve the performance of your machine learning model, then focus on engineering your data set because “high-performance model architectures are widely available”.

At GoDataDriven we partnered with Soda to add a data quality tool to our data platform proposition.

Full disclosure: I develop the soda-spark package. See my bio to read more about what I do.

# Data quality 101

The essence of data quality is simple: you have assumptions about the data which you want to validate. If you do not validate your assumptions, your data product is likely to show unexpected behavior.

A common starting point is to add data tests to your data processing pipeline:

1
assert (0 <= df["percentage_column"] <= 100).all()


This is good! It is a good start to add these checks. You will see that even the most basic checks fail - in the beginning.

# Data observability

In a professional setting you are working within a team that operates within an organization. It is not just you and your pipelines. In such a setting we need a tool which makes the data tests visible. This makes it easier to:

1. update the data tests as your data models change;
2. document and share knowledge about the data models;
3. inform downstream products that rely on your data models when tests do not pass.

There are various data quality tools that help you with this, but I want to introduce you to one in specific: soda-sql.

# Get ahead of silent data issues

The problem with the example above is that we can not formulate all data tests in such a manner. For example, let’s think of a data test for the number of rows that are processed:

1. a negative row count is impossible by definition (something would be very off!);
2. most likely you want it to be non-zero;
3. and maybe it should not be too large.
1
assert 0 < df.count() <= 1_000_000_000


Ok, this is a rudimentary test, but… what if the row count of your daily processing batch suddenly drops with 90% while it was mostly stable before???