What is a data validity check?

To make the data quality analytics most effective and profitable, it is critical to check if the data meets the necessary conditions for further usability.

If the data meets those conditions, it is considered valid, and the process to validate the data is called data validity check.

It’s easy to make a mistake while entering data by hand; simply enter a wrong digit in the box for the ID card number, one extra figure in the age field, or a number in the field for text-type data. Data provided by offices or administrative units is frequently manually entered into excel type applications, where this kind of issue is common.

As a result, data gets inserted improperly and often it is illogical; think of a person’s age. Values exceeding 110, for example, are doubtful, and those under the age of 18 are subject to various legal constraints that must be considered. To uncover problems of this nature, tests targeted at determining the data’s validity must be performed, such as the data range validity test in the example presented. The values in the age column are presumed to be valid, if they fit in the range between 18 and 110.

To make the data quality analytics most effective and profitable, it is critical to check if the data meets the necessary conditions for further useability. If the data meets those conditions, it is considered valid, and the process to validate the data is called data validity check.

In this article, we will describe what validity checks are, when they are used, and we will create a validity check on numerical data.

Table of Contents

Types of validity checks

There are many types of validity checks, depending on the type of data we are dealing with. The most obvious validity check verifies the type of the data and its format.

Apart from that, when dealing with a specific type of data, e.g. numeric data, there is a vast range of different validity tests possible: checking if data is in a certain range, or whether its values are non-negative. Performing this type of check has a key significance on data quality management.

The validity dimension corresponds to certain rules the data must fit in.

Of course, the configuration of a table in a database requires a valid format in most cases. In other words, a column set as an integer can not store any other types of data. 

The different situation is when dealing with VARCHAR type columns, where characters, digits, and special characters can occur. This leads to multiple problems such as:

The dataset

In the example, we will use Productivity Prediction of Garment Employees Data Set, which can be found here.

Below are the top 10 rows from the table.

Example of a validity check - problem statement

In this example, we will perform a validity check that inspects the range of numeric values in the “actual productivity” column. The data type of this column is FLOAT, and the values should be between 0 and 1. The values within this range are considered valid, and the rest – invalid.

We will write a query to calculate the number of valid records per day and find out the percentage of valid rows.

Technologies we use

In this case study, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform, which nowadays is one of the most frequently used tools for dealing with big datasets, especially in the cloud environment. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):

`[project_name].[dataset_name].[table_name]`.

Example of a validity check - solution

First, create a query that checks the value of the “actual_productivity”, setting the lower and upper bound of the range as 0 and 1 respectively. Notice that we use “>= 0.0” and “<= 1.0”, not “>= 0” and “<= 1”. It is related to the FLOAT data type, you should use the adequate one.

				
					SELECT CASE
		WHEN actual_productivity >= 0.0
		AND actual_productivity <= 1.0 THEN 1
		ELSE 0
	END AS value,
	date AS day_of_tested_data
FROM `dqo-learning.data_quality.garments_worker_productivity`
ORDER BY date
				
			

This result is not quite readable, it returns all of the valid data, but it would be a good idea to group it by day. Let’s do that and sum all of the valid rows per day.

				
					SELECT SUM(
		CASE
			WHEN actual_productivity >= 0.0
			AND actual_productivity <= 1.0 THEN 1
			ELSE 0
		END
	) AS value,
	date AS day_of_tested_data
FROM `dqo-learning.data_quality.garments_worker_productivity`
GROUP BY day_of_tested_data
ORDER BY date
				
			

Now it is much clearer.

Let’s find what is the percentage of valid records. To do that we need to divide the calculated sum by the sum of all records and multiply that by 100 to see the result in %.

				
					SELECT SUM(
		CASE
			WHEN actual_productivity >= 0.0
			AND actual_productivity <= 1.0 THEN 1
			ELSE 0
		END
	) / COUNT(*) * 100.0 AS value,
	date AS day_of_tested_data
FROM `dqo-learning.data_quality.garments_worker_productivity`
GROUP BY day_of_tested_data
				
			

In the first 19 leading dates, only values within the expected range are concluded,

The subsequent rows contain data that is 100% or less valid.

The data from 2015-01-26 has a validity level below 100%, so let’s have a closer look at that, as an example.

To inspect that, just run a simple query that displays each value of “actual_productivity” on a given day.
				
					SELECT actual_productivity,
	date
FROM `dqo-learning.data_quality.garments_worker_productivity`
WHERE date = '2015-01-26'
				
			

There are more than 10 rows, but let’s focus on the 9’th record. As you can see this value is greater than 1.0, so it is an invalid one.

Conclusion

Data validity is one of the fundamental aspects of data observability. You need to choose carefully the validity principles for a given problem so that checks truly return correct results. 

The covered case explains how to deal with the validation of numerical data range. The final query shows the percentage of valid data per day.

As listed at the beginning of the article, there are many more data types upon which various tests can be performed.

For more tutorials, check out our blog.

Do you want to detect data integrity issues?

Subscribe to our newsletter and learn the best data quality practices.

Share this post on your social media

Related Articles

Why is tracking Data Quality KPIs Important to Your Company?

Development and ultimate success in monitoring data quality for your business are not possible without tracking the progression of results In the business world, it …

Read More →

Data timeliness: column datetime difference percent

One of the fundamental dimensions of data quality is timeliness This metric can be measured in several ways, depending on the metric to check. One …

Read More →

Data timeliness: average delay check with dashboard

In this article, we check the data timeliness of the public BigQuery dataset using an average delay check. After checking configuration and running check, we would …

Read More →

No one can understand your data like we do!