How to make a data quality dashboard?

The development and implementation of programs or queries to check the data quality are not the sole aspects of the data quality analysis process.

To begin, we need to design, build, and integrate data from a variety of sources, frequently in incompatible environments

Then we’ll be able to use tools to perform data quality checks and save the results. The next stage is to create and administer dashboards that display data quality findings so that we can study them.

Assume you're gathering information from many sources.

Each of those sources is derived from data quality checks carried out on a large number of tables. During an hour, the database may be updated numerous times with new records. This amount of data is comparable to gigabytes or terabytes. Because of the magnitude of the data, it is absurd to display it in the form of tables.

To adequately analyze the results, we can use quality checks and commonly available visualization tools. Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on). In this article, we’ll look at how to visualize data quality results.

Table of Contents

To adequately analyze the results, we can use quality checks and commonly available visualization tools.

Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on).

In this article, we’ll look at how to visualize data quality results.

There are several steps to the data quality analysis process. Assuming that the data collection process has already been completed, the processes are as follows:

  • design, build and integrate data from various sources,
  • create and implement programs or queries to check data quality,
  • store the results of data quality checks,
  • prepare and manage dashboards showing data quality results,
  • analysis of the results.

In the previous article we described why and how to store the data quality results. Now we go to the next step and we will use stored data to make a data quality dashboard.

Technologies we use​

The results of the analysis are stored in Google Cloud BigQuery. The natural choice of visualization tool is Data Studio.

Data Studio is a widely used, free visualisation tool. It enables user to create dashboards. The data can be obtained from many sources, thanks to supporting many services such as BigQuery, Redshift, Google Analytics, PostreSQL, Google Sheets, and many more.

The dataset

In the previous article, we performed a simple quality check – null count. Let’s continue this example by visualizing some results.

The dq_check_reults table contains the following columns:
  • uuid – unique id for each result,
  • tested_resource – the name of the analyzed table,
  • test_name – the name of the data quality check,
  • value – result, in this case, a number of null values per day,
  • created_time – time the result was saved,
  • day_of_tested_data – the day when the data was tested, in this case it is a date for which NULL values were counted
Currently, in this table, there are results for three data quality checks:
  • simple_row_count,
  • max_delay_in_minutes,
  • null_count.
So we need to filter results in the first place. A simple query is used to do this. We will focus on describing the results of the NULL count for the online_retail_II table. This can be done in a few ways, each has its own advantages and disadvantages.

Ways to connect data to Data Studio

In order to prepare data for a certain dashboard, we can first create views in BigQuery. This allows us to visualize selected data quickly and easily; however, depending on the size of the table and the complexity of the query, dashboard performance may suffer.

Another option is to use Data Studio to extract the information and create additional data sources. It works by connecting Data Studio to the table and then extracting only the data we want to utilize in the dashboard in Data Studio.

There is also an option to connect Data Studio with an original table from BigQuery and create dashboards directly from it. Creating dashboards in this way is more complicated – we should filter the data for every table or plot in Data Studio, so for a data engineer who is fluent in SQL it may take more time to be done.

Such approach runs the risk of omitting the required filter, resulting in incorrect or misleading data being displayed. The main disadvantage is that the Data Studio has to process all the data from the table. This has a negative impact on the time needed to load dashboards. We do not want to distribute this solution, so it will not be covered in this article.

Preparing the data

As mentioned before, data may be prepared in two ways: creating views in BigQuery or extracting data in Data Studio. 

It’s crucial to figure out what data we will need before moving on to the next step. In most cases, it is determined by the recipient. Dashboards are typically built to assist business people in making decisions. Technical teams, on the other hand, use dashboards from time to time for retrieving data or debugging. 

Data in dashboards should be tailored to the needs of both parties, allowing them to improve their work.

For the online_retail_II table, we absolutely require information on the number of null values per day in this situation. “Value” and “day_of_tested_data” are what we need to display in our dashboard. We should utilize “tested_resource” and “test_name” to filter data rather than display them in the table because they will be the same for every row. If engineers will be using the dashboard, it is vital to include “uuid”.

Let’s assume that we want to create a dashboard for both a business client and programmers. It will present results of the null count check applied on the online_retail_II table. Summing up, we need to see following data:

  • uuid,
  • day_of_tested_data,
  • value,
  • created_time.

How to create view in Big Query

We will start in the BigQuery SQL workspace. We know that null_count check was applied only at the online_retail_II table, but let’s make sure with the query below:
				
					SELECT DISTINCT tested_resource
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
				
			

The query returned only one record, so everything is correct.

Now we know how to filter the data. The query that creates the view is shown below.

				
					SELECT uuid AS id,
    day_of_tested_data,
    value AS number_of_null_values,
    created_time
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
				
			

In order to create a view, we need to expand SAVE and select a Save View option and complete the information necessary to create the view: project name is set automatically, you have to specify the dataset and name the view.

How to extract required data from the table in Data Studio

First, we need to include the original dq_check_results table as a data source in Data Studio. To do this, click the Create button in the upper left corner of the Data Studio and select Data source.

In the next step, select the technology we will connect to. In this example we decided to use BigQuery.

Now, we need to specify the data which we want to add to the report.

In our recent data sources we can see the dq_check_results table.
Of course, following the same steps, we can connect Data Studio with the null_count_results view created before in BigQuery.

We will create a data source that, as our view in BigQuery, contains only required data. We need to create a new data source again, but this time in Google Connectors we select “Extract Data”.

Now we should complete the necessary fields.

Dimensions and date range are simple to fill.

We would like to see the sum of null values per day, so we have to create a metric. Add metric, then select the column on which the metric will be applied, in our case – value.

The sum is selected as default, when clicking on the square with the inscription sum on the left side, we can change the aggregation function by clicking on the square:

The last part of this step is applying filters. Find the section “Filter” and click ADD A FILTER, then complete the formula and save it. We want to see results only “null_count” check, so our filter contains the following information:
Finally, the page for extracting data source looks like this.

As you can see there is also an option to set auto-update. There are now two tables in recent data sources.

How to create a dashboard

Data studio offers a wide variety of plots and charts we can put on our dashboard. Depending on the type of data we want to present, there are many options starting with tables, column charts, and ending with geo charts. As an example we will create a table, and a column chart. 

Now we can start creating our first report. Click the Create button in the upper left corner of the Data Studio main website and this time select Report. In the Add data to report page select ‘My data sources’, then choose a data source. In this case, we choose null_count_results and we approve it using Add bottom on the right down corner.

In the case of a NULL check, we want to know which of the tested data sources had the most NULL values and when it happened. The easiest way to find detailed information is to get a data quality result id. We will create a pivot table with heatmap – it shows outlier values (number of null values) using color gradient.

The table is automatically completed, but it is not exactly what we expected.

On the right side, there are two panels: data and style. Let’s focus on the first one:

We should make changes here to create a table that meets our assumptions. What we need to do is to remove “day_of_tested_data” from the column dimension and add it in the row dimension. The result is successful – we have a table that shows us days with the highest number of null values and id for programmers to deal with the issue.

However, we want to make this table more friendly for business users – we need to rename columns. We can do this using the square with the inscription ‘ABC’ on the left side of the column or metric name.

The table is almost ready. Since NULL values are associated with unwanted or incorrect data, we can change the color to red for example.

In some cases, table is not the most readable way to present the data. Let’s go ahead and create a column chart that shows the daily count of NULLS during a week (2 December 2011 and 9 December 2011). Click Add a chart and select column chart. Dimension is “day_of_tested_data” and metric is “value”.

Add a filter to select the proper date. Click save.

The chart is sorted by dimension “value” in descending order. We would like to sort the data by date in ascending order. On the right-hand side corner of the chart click three dots, find Sort by, and sort by date. We will change the color to match the table

The whole dashboards looks like this

Conclusion

We’ve demonstrated optimal ways to present data from BigQuery in Data Studio, including how to use metrics dimensions and filters, as well as how to create charts.

Of course, a table and column chart are insufficient to extract the most value from data observability.

Finally, dashboards are the most commonly utilized tools for evaluating data analysis. They are simple to grasp and enable both engineers and business personnel to keep track of data quality if they are appropriately constructed. Furthermore, when an error arises, they allow IT staff to quickly resolve the issue.

For more tutorials check out our blog.

Do you want to detect data integrity issues?

Subscribe to our newsletter and learn the best data quality practices.

Share this post on your social media

Related Articles

Why is tracking Data Quality KPIs Important to Your Company?

Development and ultimate success in monitoring data quality for your business are not possible without tracking the progression of results In the business world, it …

Read More →

Data timeliness: column datetime difference percent

One of the fundamental dimensions of data quality is timeliness This metric can be measured in several ways, depending on the metric to check. One …

Read More →

Data timeliness: average delay check with dashboard

In this article, we check the data timeliness of the public BigQuery dataset using an average delay check. After checking configuration and running check, we would …

Read More →

No one can understand your data like we do!