Getting started with DQO.ai

In this article, we would like to show you how to start working with a DQO.ai on the BigQuery example.

After the initial setup, we will show how to run examples.

Then we will walk you through the process of adding connections, importing and editing tables, and defining and running checks.

Table of Contents

Examples prerequisites

The sample uses BigQuery public datasets hosted in bigquery-public-data public GCP project. You have to download and install Google Cloud CLI .
After installing Google Cloud CLI, log in to your GCP account (you can start one for free), by running:

				
					gcloud auth application-default login
				
			

After setting up your GCP account, create your GCP project. That will be your GCP billing project used to run SQL sensors on the public datasets provided by Google.

The examples are using the name of your GCP billing project, received as an environment variable GCP_PROJECT.
You have to set and export this variable before starting the DQO shell.

Running examples

We will show how to run a date_type_percent check from the examples (the rest of them is done analogously). Open up a terminal, and go to the directory where dqo.ai is installed.

Define an environmental variable that will correspond to your GCP project ID and billing project ID.

Switch the directory to the one with example and run the app. You can just copy and paste the following command to your terminal, and the application will start.

				
					cd examples/bigquery-column-date-type-percent
..\..\dqo.cmd
				
			

Now you can go ahead and run the check by executing

				
					check run
				
			

The result is

				
					dqo.ai> check run
Check evaluation summary per table:
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection     |Table                         |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|publicdata_bq_1|labeled_patents.extracted_data|1     |1             |0            |0           |0              |1            |
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
				
			

You can run this check in a debug mode:

				
					check run -m=debug
				
			

and the result will show step by step execution

				
					dqo.ai> check run -m=debug
**************************************************
Executing quality checks on table labeled_patents.extracted_data from connection publicdata_bq_1
**************************************************

**************************************************
Executing a sensor for a check date_type_percent on the table labeled_patents.extracted_data using a sensor definition column/validity/date_type_percent
**************************************************

**************************************************
Calling Jinja2 rendering template column/validity/date_type_percent/bigquery.sql.jinja2
**************************************************

**************************************************
Jinja2 engine has rendered the following template:
column/validity/date_type_percent/bigquery.sql.jinja2
**************************************************

**************************************************
Executing SQL on connection publicdata_bq_1 (bigquery)
SQL to be executed on the connection:
SELECT
    100.0 * SUM(
        CASE
            WHEN SAFE_CAST(analyzed_table.`publication_date` AS FLOAT64) IS NOT NULL
            OR SAFE_CAST(analyzed_table.`publication_date` AS DATE) IS NOT NULL
            OR SAFE.PARSE_DATE('%d.%m.%Y', analyzed_table.`publication_date`) IS NOT NULL THEN 1
            ELSE 0
        END
    ) / COUNT(*) AS actual_value, CAST(CURRENT_TIMESTAMP() AS date) AS time_period
FROM `bigquery-public-data`.`labeled_patents`.`extracted_data` AS analyzed_table
GROUP BY time_period
ORDER BY time_period
**************************************************

**************************************************
Finished executing a sensor for a check date_type_percent on the table labeled_patents.extracted_data using a sensor definition column/validity/date_type_percent, sensor result count: 1

Results returned by the sensor:
+----------------+-----------+
|actual_value    |time_period|
+----------------+-----------+
|71.7201166180758|2022-05-04 |
+----------------+-----------+
**************************************************

**************************************************
Finished executing rules (thresholds) for a check date_type_percent on the table labeled_patents.extracted_data, verified rules count: 1

Rule evaluation results:
+----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+---------------+--------------+-------------------+----------------+-------------------+-----------------+-----------------+---------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|actual_value    |expected_value|time_period     |time_gradient|dimension_id|connection_hash    |connection_name|provider|table_hash         |schema_name    |table_name    |column_hash        |column_name     |check_hash         |check_name       |quality_dimension|sensor_name                      |executed_at             |duration_ms|severity|rule_hash          |rule_name|high_lower_bound|medium_lower_bound|low_lower_bound|
+----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+---------------+--------------+-------------------+----------------+-------------------+-----------------+-----------------+---------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|71.7201166180758|90.0          |2022-05-04T00:00|day          |0           |3506421218842057901|publicdata_bq_1|bigquery|7226004687294253978|labeled_patents|extracted_data|3815616989618007012|publication_date|3360128111520269927|date_type_percent|validity         |column/validity/date_type_percent|2022-05-04T11:49:24.040Z|2908       |3       |8286102381750941021|min_count|90.0            |95.0              |98.0           |
+----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+---------------+--------------+-------------------+----------------+-------------------+-----------------+-----------------+---------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
**************************************************

Check evaluation summary per table:
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection     |Table                         |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|publicdata_bq_1|labeled_patents.extracted_data|1     |1             |0            |0           |0              |1            |
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
				
			

Adding connections manually

Let’s define the same check as in the example but manually.

Let’s begin with adding a connection, provide your information in the following command

				
					connection add -n={name} -t=bigquery -P=bigquery-source-project-id={project_ID} -P=bigquery-billing-project-id={billing_ID} -P=bigquery-quota-project-id={quota_ID} -P=bigquery-authentication-mode=google_application_credentials -hl
				
			

with all the necessary information, the connection will be added in a headless mode. Run the following command to import tables.

				
					table import -c={connection_name}
				
			

and choose number 112 from the list. Then open table configuration with 

				
					table edit -c={connection_name} -t=labeled_patents.extracted_data
				
			

A code editor should open, feel free to copy and paste (or add the highlighted “check” section in the right place) in your YAML file.

				
					# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: labeled_patents
    table_name: extracted_data
  time_series:
    mode: current_time
    time_gradient: day
  checks: {}
  columns:
    gcs_path:
      type_snapshot:
        column_type: STRING
        nullable: true
    issuer:
      type_snapshot:
        column_type: STRING
        nullable: true
    language:
      type_snapshot:
        column_type: STRING
        nullable: true
    publication_date:
      type_snapshot:
        column_type: STRING
        nullable: true
      checks:
        validity:
          date_type_percent:
            parameters:
              custom_date_format: "%d.%m.%Y"
            rules:
              min_count:
                low:
                  min_value: 98.0
                medium:
                  min_value: 95.0
                high:
                  min_value: 90.0
    class_international:
      type_snapshot:
        column_type: STRING
        nullable: true
    class_us:
      type_snapshot:
        column_type: STRING
        nullable: true
    application_number:
      type_snapshot:
        column_type: STRING
        nullable: true
    filing_date:
      type_snapshot:
        column_type: STRING
        nullable: true
    priority_date_eu:
      type_snapshot:
        column_type: STRING
        nullable: true
    representative_line_1_eu:
      type_snapshot:
        column_type: STRING
        nullable: true
    applicant_line_1:
      type_snapshot:
        column_type: STRING
        nullable: true
    inventor_line_1:
      type_snapshot:
        column_type: STRING
        nullable: true
    title_line_1:
      type_snapshot:
        column_type: STRING
        nullable: true
    number:
      type_snapshot:
        column_type: STRING
        nullable: true

				
			

Save the configuration and run the command to execute checks just as in the example.

Conclusion

There are many more checks and examples to explore. This article shows the most basic commands that enable you to successfully run the application.

Check out our blog where we describe how to use the application in details:

Do you want to detect data integrity issues?

Subscribe to our newsletter and learn the best data quality practices.

Share this post on your social media

No one can understand your data like we do!