Skip to content
Data Reliability Updated Feb 22 2023

What is Data Validity?

What is data validity
AUTHOR | Michael Segner

Your data pipeline runs smoothly. Reports generate on schedule. Dashboards update in real time. Everything looks professional and polished. Then someone notices that a customer apparently lived for 247 years, another placed an order in 1823, and somehow your company shipped negative three laptops to an address that doesn’t exist.

These aren’t just amusing anecdotes to share at team meetings. They’re symptoms of a validity problem that’s quietly undermining every decision your organization makes. Invalid data doesn’t announce itself with system crashes or error messages. It slips through unnoticed, corrupting analyses, breaking processes, and leading to conclusions that seem reasonable but are completely wrong.

Data validity is the foundation of trustworthy analytics. It’s the difference between knowing your customer base is aging and discovering that someone’s been entering birth years instead of birth dates. It’s what separates insights you can act on from expensive mistakes waiting to happen.

This article breaks down everything you need to know about data validity. You’ll learn what it actually means for data to be valid, the different types of validity that matter for your systems, and practical methods to catch and prevent validity issues before they spread. We’ll cover the specific rules and checks that work, show you how to implement them at scale, and explain how validity fits into your broader data quality strategy.

Whether you’re a data engineer building pipelines, an analyst questioning suspicious metrics, or a leader who needs to trust the numbers in front of you, this guide gives you the tools to ensure your data reflects reality rather than fiction.

Table of Contents

What is data validity?

Data validity is the degree to which your data accurately represents reality and conforms to the rules and constraints you’ve defined for it. Valid data does what you expect it to do and behaves how you expect it to behave.

Think of data validity as your information’s trustworthiness score. When your customer database shows someone living in “New Yrok” or lists their age as 250, you’ve got a validity problem. These aren’t just typos or quirks. They’re signals that your data can’t be trusted for the decisions you need to make.

Valid data correctly captures real-world entities and events while meeting your specific business criteria. This means phone numbers follow the right format. Dates fall within reasonable ranges. Product codes match your inventory system. Transaction amounts make logical sense.

The stakes are straightforward. Invalid data leads to wrong conclusions, failed processes, and poor decisions. You ship products to addresses that don’t exist. Your predictive models fail because they’re trained on garbage. Your compliance reports get rejected because the numbers don’t add up.

Data validity isn’t about perfection. It’s about fitness for purpose. Your sales data needs to be valid enough to forecast revenue. Your customer data needs to be valid enough to deliver packages. Different use cases demand different validity thresholds, and understanding this distinction saves you from both over-engineering and under-delivering.

Best practices for ensuring and maintain data validity

Getting valid data isn’t luck. It’s the result of deliberate processes and smart safeguards at every stage of your data lifecycle. Here’s how to build a system that catches problems before they corrupt your decisions.

Validate data at the point of entry

Stop bad data before it enters your system. This means building validation directly into your applications, forms, and ingestion points. Replace free text fields with dropdown menus where possible. Set up format checks that reject phone numbers with letters or dates from the year 3000.

For data engineers, this translates to schema enforcement and input constraints. Configure your database to reject null values in required fields. Build your APIs to validate incoming data against expected patterns. Every invalid record you block at entry is one less problem multiplying through your downstream systems.

The economics are simple. Fixing a data validity issue at entry costs pennies. Fixing it after it’s propagated through five systems and influenced three executive decisions costs thousands.

Use automated data quality checks

Manual validation doesn’t scale. You need automated rules running continuously through your pipelines, checking every record against your validity criteria. Set up scripts that scan for format mismatches, unexpected nulls, and values outside acceptable ranges.

Modern data observability platforms make this easier than ever. Configure alerts that fire when sensor readings spike beyond physical possibilities or when transaction amounts go negative. Run these data quality checks in real time for critical data or as nightly batch jobs for less urgent datasets.

The key is making these checks systematic, not sporadic. Invalid data doesn’t take weekends off, and neither should your validation processes.

Enforce predefined rules and constraints

Define what valid means for each field and enforce it ruthlessly. Create reference tables for acceptable values like country codes, product categories, or status indicators. Configure your systems to reject or quarantine any record that doesn’t match these predefined lists.

Database constraints are your friends here. Foreign keys ensure referential integrity. Unique constraints prevent duplicates. Check constraints and enforce business rules directly at the data layer. These aren’t just nice to have. They’re your last line of defense against invalid data.

Data contracts take this further by formalizing expectations between data producers and consumers. When everyone agrees that customer age must be between 0 and 120, you can build that rule into your systems and trust it will be enforced.

Regular data auditing and cleansing

Even the best preventive measures need backup. Schedule regular audits where you profile your data looking for patterns that suggest validity issues. Run summary statistics. Check for sudden spikes in null values. Look for outliers that shouldn’t exist.

Data profiling tools can automate much of this work, but don’t ignore manual spot checks. Sometimes a human eye catches what algorithms miss. That customer with an order date in 1970 might technically pass your validation rules, but it’s still wrong.

When you find invalid data, fix it fast. Have clear processes for correction or removal. Document what you changed and why. Invalid data is like rust. Left unchecked, it spreads.

Implement feedback loops

Your end users are validity detectors whether you realize it or not. When an analyst notices something weird in a report or a business user questions a metric, that’s a valuable signal. Build processes to capture and act on this feedback.

Create simple ways for data consumers to flag issues. Maybe it’s a Slack channel, a ticketing system, or a feedback button in your dashboards. When someone reports that customer ages look wrong, trace it back to the source and fix the root cause, not just the symptom.

These feedback loops do double duty. They catch validity issues your automated checks miss, and they build trust with your stakeholders. People trust data more when they know their concerns are heard and addressed.

Training and documentation

Tools and processes fail without people who understand them. Train everyone who touches data on what validity means for your organization. Show data entry staff why formatting matters. Help analysts understand the business rules behind the constraints.

Clear documentation is equally critical. When you document that “order_date cannot be in the future,” you’re not stating the obvious. You’re creating a shared understanding that prevents future confusion and errors. List your validity rules. Explain your constraints. Define acceptable ranges and formats.

Good documentation serves as both reference guide and training material. It ensures that your validity standards survive personnel changes and scaling challenges. The best validation system in the world fails if no one knows how to use it properly.

What are the types of data validity?

Data validity isn’t a single thing. It breaks down into distinct types, each addressing a different aspect of whether your data can be trusted. Knowing these types helps you build complete validation strategies instead of hoping one approach catches everything.

Format validity

This is your first line of defense. Format validity means data follows the required pattern or structure. Email addresses contain an @ symbol. Phone numbers have ten digits. Dates follow YYYY-MM-DD format.

When format validity fails, you get chaos. Systems crash trying to parse dates written as “January 1st” when they expect “2024-01-01”. APIs reject perfectly good customer data because someone included parentheses in a phone number. These aren’t edge cases. They’re daily realities that format validation prevents.

Range validity

Your data needs boundaries. Age can’t be negative. Product ratings stay between 1 and 5. Temperature readings from your warehouse sensors shouldn’t exceed 150 degrees Fahrenheit unless your warehouse is on fire.

Range validity catches the outliers that signal deeper problems. That 999 in your age field? Probably a placeholder someone forgot to update. The negative revenue number? Either a data entry error or you need to have a serious conversation with accounting.

Content or domain validity

Valid data tells the complete story. If your customer satisfaction survey asks about service and product quality but ignores pricing, you’re missing critical information. Your data might be perfectly formatted and within range, but it still fails content validity because it doesn’t cover what it claims to represent.

This type of validity matters most when you’re making strategic decisions. Incomplete data leads to incomplete conclusions. You can’t assess customer experience if you’re only measuring half of it.

Face validity

Sometimes you just need a sanity check. Face validity is your gut reaction when you look at data. Does this make sense? Could this be real?

When your sales report shows a customer buying negative three laptops or your employee database lists someone as 247 years old, you’ve got a face validity problem. It’s not sophisticated, but it’s effective. If data looks wrong at first glance, it probably is.

Construct validity

Here’s where things get subtle. Construct validity asks whether your data actually measures what you think it measures. You create an “innovation score” for companies, but does it really capture innovation or just company size and R&D spending?

This matters enormously for analytics and modeling. That customer engagement metric you’re using to predict churn might actually be measuring account age. Your employee satisfaction survey might be capturing office location effects rather than true satisfaction. Construct validity forces you to question your assumptions.

Criterion-related validity

Your data should predict what it claims to predict. If high lead scores don’t correlate with actual conversions, those scores lack criterion validity. If your risk model flags low-risk customers as high-risk, you’ve got a validity problem that goes beyond bad data to bad conclusions.

Test this by checking outcomes. Do your predictions match reality? When you score something as high quality, does it actually perform better? Criterion validity is where theory meets practice.

Internal and external validity

These concepts come from research but apply directly to business analytics. Internal validity means your analysis correctly identifies cause and effect. When your A/B test shows a 10% conversion lift, can you be sure the new feature caused it, not some other factor?

External validity asks whether your findings generalize. That conversion lift you measured on power users in California during summer might not apply to casual users in Maine during winter. Valid data from one context doesn’t automatically stay valid in another.

Statistical conclusion validity

Even perfect data can lead to wrong conclusions if you analyze it incorrectly. Statistical conclusion validity ensures your methods match your data and your questions. Using the wrong test, inadequate sample sizes, or ignoring statistical assumptions invalidates your conclusions regardless of data quality.

This reminds us that data validity extends beyond the data itself. A t-test on non-normal data or correlation analysis on non-linear relationships produces invalid results even with valid inputs. Your statistical approach needs to be as rigorous as your data collection.

Data validity quality rules

These rules are often developed after profiling the data and seeing where it breaks. 

Valid values in a column

This is useful when you only want known values, such as two-letter country codes. 

One way to check if a value is valid is to create a lookup table that contains a list of all valid country codes. You can then compare the country code in your column to the list of valid codes in the lookup table. Here’s an example:

SELECT COUNT(*) as num_invalid_codesFROM table_nameWHERE column_name IS NOT NULL AND column_name NOT IN (  SELECT code  FROM iso_3166_1_alpha_2)

If the count is 0, then all of the country codes in the column are valid according to ISO 3166-1 alpha-2. If the count is greater than 0, then there are one or more invalid codes in the column.

Column conforms to a specific format

When you have specific format requirements like telephone numbers or social security numbers, write a rule to check whether the data in that column conforms to or violates that format. 

For example, to make sure all the data conforms to the phone number format (XXX) XXX-XXXX you can use this rule:

SELECT COUNT(*) as num_invalid_numbers
FROM table_name
WHERE column_name IS NOT NULL AND column_name NOT LIKE '([0-9]{3}) [0-9]{3}-[0-9]{4}'

Note that the regular expression used in this query is just an example and you may need to modify it to match the format of phone numbers in your data.

Primary key integrity

You can also write a validity rule to test the integrity of the primary key column. Integrity just means the primary key has no duplicate data values.

SELECT COUNT(*)
FROM table_name
GROUP BY primary_key_column
HAVING COUNT(*) > 1;

If this query returns 0, then there are no duplicate values in the primary key column and the primary key integrity is preserved.

Nulls in a column

You can use the COUNT function in combination with the IS NULL operator.

SELECT COUNT(*) as num_nulls
FROM table_name
WHERE column_name IS NULL;

Besides nullability, other column property values that would be good to check are data type, length, precision, and scale.

Data validity business rules

Business rules are another important aspect to consider when checking data validity. Sometimes, there might already be rules in place that apply, or new ones might be developed as you analyze prior data. Examples of validity rules created from business rules include:

Valid value combination

Valid value combinations are rules that specify which combinations of values are allowed or disallowed. For example, there could be a business rule that surgery is always performed in a hospital and if the data shows otherwise, the data is invalid. 

Type of ServicePlace Of ServiceValidity
SurgeryHospitalValid
SurgeryAmbulanceInvalid
SurgeryHospitalValid
SurgeryMorgueInvalid

Computational

This rule uses math to check whether multiple numerical columns are related to each other in the right way. This can be expressed as an equation (e.g., hourly pay rate multiplied by the number of hours worked must equal the gross pay amount) or a set (e.g., the total of all individual order amounts must equal the total order amount).

Hourly RateHours WorkedGross PayValidity
$35.00152$5,320Valid
$40.00144$6,336Invalid
$42.00150$6,300Valid

Chronological

These rules validate time and duration relationships. A quick example is a flight seat change request can only be done prior to the flight departure time.

Here’s another example: the date/time of a newborn hearing screening should only be after the date/time of birth.

Date of BirthDate of Hearing TestValidity
May 5Jun 2Valid
Jun 5Jul 1Valid
Aug 5Sep 4Valid
Sep 5Sep 1Invalid

Conditional

These rules contain complex if…then…else logic that might include valid values combinations, computational and chronological conditions.

For example, a customer should not be charged a flight seat change fee if today’s date is after the flight departure date, if the customer’s fare class states seat changes are free, or if the reward status of the customer permits free seat changes.  

Data validity is just one piece of a larger data quality picture

Validity rules test whether or not data meets known criteria, but what about testing for validity issues you can’t even anticipate?

To automatically monitor issues without thresholding or configuration, use a data + ai observability platform like Monte Carlo that uses machine learning to determine what to monitor and the thresholds to set. It covers not just data validity, but many more data quality dimensions, too. 

Learn more in our blog post Data Observability Tools: Data Engineering’s Next Frontier.


Interested in learning more about data observability? Set up a time to talk to us using the form below.

Our promise: we will show you the product.