Data Reliability

Data Integrity Testing: 7 Examples From Simple to Advanced

Data Integrity Testing Examples: From Simple to Advanced

Lindsay MacDonald

Lindsay is a Content Marketing Manager at Monte Carlo.

You know those little disclaimers you always see about past performance not being indicative of future results? 

Well, the same goes for your data. Just because it looked good yesterday doesn’t mean it’ll hold up tomorrow – and that’s why we’re talking about data integrity testing today.

Data integrity testing is the process of ensuring data is fit for the task at hand and available to only those who should have access.

Some examples include validating that values are in the right format, checking for duplicate records, making sure relationships between tables still make sense, and comparing current data to source systems.

Checks in SQL can help, so we’re sharing 7 examples, from simple to advanced, to get started on your data integrity testing. Plus, we share next steps for how data observability tools can help your team maintain and automate data integrity in larger, more complex data stacks.

Simple data integrity testing examples

Let’s assume you have a table called [Orders] which stores order information:

CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    ProductID INT NOT NULL,
    Quantity INT NOT NULL,
    OrderDate DATE NOT NULL
);

Now, let’s look at a few simple data integrity tests you could do.

Check for missing values

You want to ensure that there are no missing values in the [ProductID, Quantity,] and [OrderDate] columns:

SELECT * FROM Orders
WHERE ProductID IS NULL OR Quantity IS NULL OR OrderDate IS NULL;

This SQL query will return any rows where any of those columns contain missing values. If the query returns no rows, then the data integrity testing constraint on those columns is satisfied.

Check for duplicate orders

Each [OrderID] should be unique:

SELECT OrderID, COUNT(*) as DuplicateCount 
FROM Orders
GROUP BY OrderID
HAVING COUNT(*) > 1;

If this query returns any rows, it indicates there are duplicate [OrderID]s in the table.

Check for negative quantities

An order quantity should not be negative:

SELECT * FROM Orders
WHERE Quantity < 0;

If this query returns any rows, it indicates there are orders with negative quantities.

Check for future orders

Let’s say you don’t accept orders with future dates:

SELECT * FROM Orders
WHERE OrderDate > CURRENT_DATE;

If this query returns any rows, it indicates there are orders with dates in the future.

These are just simple examples. In real-world scenarios, data integrity tests can get quite complex depending on the business rules and data model.

More advanced data integrity testing examples

Now, let’s take a hypothetical scenario of a retail eCommerce platform with two related tables: [Customers] and [Orders].

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    Email VARCHAR(255) NOT NULL,
    SignupDate DATE NOT NULL
);

CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    CustomerID INT,
    ProductID INT NOT NULL,
    Quantity INT NOT NULL,
    OrderDate DATE NOT NULL,
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

Here’s a more advanced integrity check:

Referential integrity

You want to ensure that every order has a valid customer associated with it.

SELECT * FROM Orders
WHERE CustomerID IS NOT NULL AND CustomerID NOT IN (SELECT CustomerID FROM Customers);

This would fetch orders that have a CustomerID not found in the Customers table, potentially highlighting orphaned records.

Check for duplicate emails

Each customer’s email should be unique:

SELECT Email, COUNT(*) as DuplicateCount 
FROM Customers
GROUP BY Email
HAVING COUNT(*) > 1;

Temporal integrity

Customers shouldn’t have orders before they actually signed up:

SELECT o.OrderID, o.CustomerID, o.OrderDate, c.SignupDate
FROM Orders o
JOIN Customers c ON o.CustomerID = c.CustomerID
WHERE o.OrderDate < c.SignupDate;

While these SQL queries are great at getting a quick look at the integrity of your data, you can’t create a data integrity test for every way data can break, and if you could, it would be impossible to consistently scale across all of your pipelines.

How to automate data integrity testing

Managing and maintaining data integrity, especially in larger and more complex data stacks, can be a challenging task when done manually. 

That’s why data observability solutions like Monte Carlo are essential. By using machine learning to automatically monitor for broken data pipelines and data anomalies, data engineering teams can resolve these issues rapidly and prevent data downtime

For instance, Monte Carlo automatically deploys across your most critical data pipelines, catching freshness, volume, and schema issues out-of-the-box, like stale fields that impact the integrity of an executive dashboard or breaking schema changes accidentally pushed by upstream software engineers. 

These issues (and many others!) can be easily and scalably caught with ML-enabled data observability solutions. Additionally, data observability solutions like Monte Carlo provide the necessary triaging and resolution tools to quickly resolve issues before they wreak havoc on downstream consumers. 

Interested in how data observability can improve your data integrity? Schedule time to talk to us in the form below!

Our promise: we will show you the product.