Types of data timeliness checks

One of the most important aspects of data quality is data timeliness.

It refers to the expected time of availability and accessibility of the data. Timeliness is a metric of major importance since out-of-date information can lead individuals to make poor decisions. As a result, companies lose time, money, and their reputation.

Of course, data timeliness has lots of interpretations, so it can answer lots of questions, e.g.

  • Is the data up-to-date and is it available?
  • How fresh is the data?
  • What is the lag between adjacent records?
  • What is an average lag between records?
  • What is the delay in data upload?

Table of Contents

The problem source

In order to fully understand the timeliness, we have to think about why data might not be timely?

Well, it depends on the architecture of your system.

Assume you have an ETL pipeline that enters data into your database. The two most common types of errors are underestimating the volume of data and using inappropriate hardware. When the pipeline fails, it stops loading new data incrementally, which generates trouble if we need the latest records.

For another example pretend you are using a scheduler like Airflow to upload data to your database at the end of each day. If we do not define a timeout for a specific task in DAG, or if we choose a timeout that is too short, the process may fail. It’s possible that one task will crash and the next won’t start. As a result, we don’t have data for today, which puts us in a bind.

There are many more examples of what could go wrong, but usually, the solution is the same in each situation: perform regular timeliness checks.

Data timeliness categories

A data delay is a latency between certain events in a data pipeline. We check the time gap between such occurrences in a variety of situations. In the following paragraphs, we’ll go through a few of them and why they should be tested.

Data delays can be predictable, e.g. when a source preprocesses or batches the data values before uploading them, or unpredictable, e.g. when data flow is disrupted by a sudden network slowness or shutdown.

Timeliness checks aim at detecting anomalies in data transport. A certain threshold value is set, depending on the situation (e.g. 1 hour or 1 day of delay is expected and acceptable). Any delay above this threshold is considered an anomaly and is reported by a timeliness check. 

In other words, timeliness checks aim at finding unpredictable delays and anomalies in predictable ones.

Data delays can occur anywhere in the data pipeline, so we can distinguish a couple of basic types of delays. 

Upload delay is a time difference between events (start or end of processing) at the ingestion stage and the moment the files corresponding to those events land in a data warehouse. In this case, we might be interested in max, minimal, or average delays. 

Ingestion time is a time difference between the start of the ingestion stage and the moment when all the files are found in a data warehouse.

Upload and ingestion delays checks are irreplaceable when we lack some data or dashboard scores begin to lower.

The current delay is a time difference between the current timestamp and the timestamp when the last file appears in the database

Data freshness is a time difference between the moment that the data that interests appear in a data warehouse and the moment is accessible and useable.

Current delay and data freshness checks provide information on whether the data that went through the pipeline is accessible and can be used.

Conclusion

Data observability is data timeliness is the cornerstone of data observability. It tells us when data is collected when particular processes are performed and when data is available to use. 

When dealing with a whole operational data management system in the real world, it is a starting point for troubleshooting when an error arises.

Check out our blog for more articles about data quality.

Do you want to detect data integrity issues?

Subscribe to our newsletter and learn the best data quality practices.

Share this post on your social media

Related Articles

No one can understand your data like we do!