Data Observability for Data Lake

Bring Data Governance to the Data Lake

How many tables in your Data Lake have inconsistent data formats?

Define Data Quality rules for the data in the Data Lake. Continuously monitor Data Quality metrics to detect discrepancies.

Data Quality Monitoring

Data Quality Monitoring

Define Data Quality rules for external tables based on flat files (CSV, Parquet) to detect when files not matching Data Quality rules were loaded.


Define an external table with partitioning on a date column. Ingest files to a folder with the current date. DQO.ai Data Quality rules may be executed for date partitions to detect days with invalid source files.

  • Detect days with invalid files
  • Detect days with files that do not match data format, uniqueness, nullability or range checks
  • Detect missing days with completeness tests

Unhealthy partitions

Unhealthy partitions

Detect partitions in a Data Lake that are corrupted by invalid files or unavailable HDFS nodes.

DQO.ai will just run full table scan queries on all partitions to detect unreadable files. Availability Data Quality checks executed for each partition will detect unavailable partitions that must be repaired.

  • Detect partitions that are unavailable due to corrupted parquet files
  • Detect tables and partitions whose files are stored on offline or corrupted HDFS nodes
  • Make sure that some tables are always usable in the Data Lake

Trusted Data Lake Tables

Trusted Data Lake Tables

Identify tables in the Data Lake that are trusted and usable for analytics and data science by defining and checking Data Quality rules for those tables.

DQO.ai Data Quality rules will be defined only for those external tables that are considered as a source of truth. The Data Quality rules will document the quality requirements. The Data Quality will be ensured by running the quality rules every day.

  • Document the Data Quality checks that are ensured for trustworthy tables
  • Verify the Data Quality rules for important tables
  • Let the data scientists and analysts use only verified tables

File format checks

File format checks

Detect when files that have a wrong format or missing columns were loaded to the Data Lake.


Define consistency checks that analyze the behavior and averages of key columns. A wrong distinct count of a column or an increase of percentage of null values will indicate that columns were reversed or a new file has some columns missing.

  • Detect missing columns in new files
  • Detect when columns were reversed or missing in CSV files, which affected loading new data into wrong columns
  • Ensure that the external table always meets the data format and data ranges checks

Data Observability at Petabyte Scale

Data Observability at Petabyte Scale

Analyze Petabyte Scale tables by analyzing only new or modified data.

DQO.ai was built with partitioning in mind. Analyze data for time partitions or build custom Data Quality checks that will analyze only partitions with new data. Identify new data by reading data processing logs.

  • Observe Data Quality at Petabyte Scale
  • Analyze only new or modified data to avoid Data Lake pressure or high query processing cost
  • Use your custom logs as a source to identify modified partitions that should be analyzed

No one can understand your data like we do!