Skip to content
Data Reliability Updated Oct 27 2025

Data Munging That Actually Works

Data munging
AUTHOR | Lindsay MacDonald

Data never shows up ready for the party. It comes in late, messy, and mislabeled. It brings extra columns no one asked for and leaves the important stuff blank. That’s where data munging comes in. Data munging is the process of cleaning, transforming, and structuring raw data so it’s usable.

Think of munging as helping your data get its act together. Sometimes that mess lives in a spreadsheet, but more often, it’s tucked away in logs, APIs, inconsistent tables, or mystery CSVs with columns no one remembers creating.

So what does good data munging actually look like?

What is Data Munging?

6-step data munging workflow

At its core, data munging (also called data wrangling) means taking raw, messy data and turning it into something tidy, structured, and ready for analysis.

This is done in 6 steps:

  1. Profiling – figuring out what’s in your dataset (and what’s wrong with it)
  2. Parsing – converting strings into real types like numbers or dates
  3. Standardizing – making sure values use consistent names, units, or formats
  4. Validating – checking that the data meets your expectations
  5. Imputing or reducing – filling in blanks or dropping unusable data
  6. Documenting – keeping track of changes so others (or future you) understand what you did

A good rule of thumb: don’t guess, codify. Everything you “fix” should be reproducible. And don’t wait until modeling to clean; do it upfront so downstream work is painless.

Alright, let’s put this into practice.

The 6-Step Data Munging Workflow

Imagine we’re cleaning up an orders dataframe in Python that looks a little rough:

order_idorder_datecountrycurrencyamountweightemail
10012025/01/02ususd1200.502 lba@x.com
100202-01-2025United StatesUSD3000.9kgbad-email
1002USAeur950,001.5 lbsb@y.com
10032025-01-04 13:00 -0800USD500 gc@z.com
10042024-12-31deEUR120.00d@x.com

Yikes, there’s a lot going on here. There are duplicate order_ids, inconsistent country labels (like “us”, “USA”, and “United States”), and probably more issues hiding under the surface. This dataset definitely needs some cleaning up, so the first step is taking a closer look at exactly what’s off.

1) Profile

First, take a peek under the hood. What’s missing? What’s weird? Are there duplicates? Broken formats?

df.info()
df.isna().sum()
df.nunique()
df.describe(include="all")

Let’s also do a quick check for obvious issues, like bad emails or repeated IDs.

email_ok = df["email"].str.contains(r"^[^@\s]+@[^@\s]+\.[^@\s]+$", na=False)

profile = {
    "invalid_email_rows": df.loc[~email_ok].index.tolist(),
    "duplicate_order_ids": df["order_id"][df["order_id"].duplicated()].unique().tolist(),
    "currency_values": df["currency"].str.upper().value_counts().to_dict()
}

That will show us these issues:

{
  'invalid_email_rows': [1],
  'duplicate_order_ids': [1002],
  'currency_values': {'USD': 3, 'EUR': 2}
}

2) Parse & Type-Cast

Now that we know what’s wrong, let’s fix the data types so everything works smoothly.

# Parse order dates to proper UTC timestamps
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce", utc=True)

# Clean up the amount column
df["amount_num"] = (
    df["amount"].str.replace(",", ".", regex=False)  # handles things like "950,00"
                  .str.replace(r"[^\d\.]", "", regex=True)
                  .astype(float)
)

# Normalize enums like currency and country
df["currency"] = df["currency"].str.upper().astype("category")
df["country"]  = df["country"].replace("", np.nan).str.strip().str.lower()

This will result in the data looking like this:

order_idorder_datecountrycurrencyamount_numweightemail
010012025-01-02 00:00:00+00:00usUSD1200.52 lba@x.com
110022025-02-01 00:00:00+00:00united statesUSD300.00.9kgbad-email
21002NaTusaEUR950.01.5 lbsb@y.com
310032025-01-04 21:00:00+00:00NoneUSDNaN500 gc@z.com
410042024-12-31 00:00:00+00:00deEUR120.0Noned@x.com

3) Standardize

Let’s also make sure values like country names or units of weight are consistent.

# Map countries to ISO2 codes
country_map = {
    "us": "US", "usa": "US", "united states": "US",
    "de": "DE", "germany": "DE"
}
df["country_iso2"] = df["country"].map(country_map).fillna(df["country"].str.upper())

# Convert amounts to USD
fx = {"USD": 1.0, "EUR": 1.2}
df["amount_usd"] = df.apply(lambda r: r["amount_num"] * fx.get(r["currency"], np.nan), axis=1)

# Convert weight to kilograms
def to_kg(x):
    if pd.isna(x): return np.nan
    s = str(x).replace(" ", "").lower()
    if s.endswith("kg"): return float(s.replace("kg", ""))
    if s.endswith("g"):  return float(s.replace("g", "")) / 1000
    if "lb" in s:        return float(s.replace("lbs", "").replace("lb", "")) * 0.453592
    return np.nan

df["weight_kg"] = df["weight"].apply(to_kg)

This gives us a tidy, comparable set of values:

order_idorder_datecountry_iso2currencyamount_usdweight_kgemail
010012025-01-02 00:00:00+00:00USUSD1200.500.907184a@x.com
110022025-02-01 00:00:00+00:00USUSD300.000.900000bad-email
21002NaTUSEUR1140.000.680388b@y.com
310032025-01-04 21:00:00+00:00NaNUSDNaN0.500000c@z.com
410042024-12-31 00:00:00+00:00DEEUR144.00NaNd@x.com

4) Validate

Now it’s time to turn our assumptions into code. That way, if something breaks again, we’ll catch it fast.

# Basic rules
assert df["order_id"].notna().all()

# Custom checks
rules = {
    "order_date_not_null": df["order_date"].notna().all(),
    "amount_usd_nonneg":  (df["amount_usd"].dropna() >= 0).all(),
    "currency_enum_ok":   set(df["currency"].cat.categories).issubset({"USD","EUR"}),
    "email_pattern_ok":   df["email"].str.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$", na=False).all()
}

Result:

{
  'order_date_not_null': False,
  'amount_usd_nonneg': True,
  'currency_enum_ok': True,
  'email_pattern_ok': False
}

Now if we enforce the email rule, we’ll get an error:

assert rules["email_pattern_ok"], f"Invalid emails found at rows: {df.loc[~df['email'].str.match(r'^[^@\s]+@[^@\s]+\.[^@\s]+$', na=False)].index.tolist()}"
AssertionError: Invalid emails found at rows: [1]

You can also pull out any duplicate orders to investigate further.

dupes = df[df.duplicated(subset=["order_id"], keep=False)]
dupes[["order_id","order_date","email"]]

5) Impute or Drop

The last data manipulation is to fill in some data while dropping others. This is fine, we just need to make sure it’s rule-based and documented so there’s a paper trail.

audit_log = []

# Drop rows with invalid emails
invalid_email_idx = df.index[~df["email"].str.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$", na=False)]
if len(invalid_email_idx):
    audit_log.append({"action":"drop_invalid_email", "rows": invalid_email_idx.tolist()})
    df = df.drop(index=invalid_email_idx)

# Deduplicate based on most recent order_date
dupe_mask = df.duplicated(subset=["order_id"], keep=False)
keepers = (df[dupe_mask]
           .sort_values(["order_id","order_date"])
           .groupby("order_id", as_index=False)
           .tail(1))
df = pd.concat([df[~dupe_mask], keepers]).sort_values("order_id").reset_index(drop=True)
audit_log.append({"action":"dedupe_order_id_latest", "detail":"keep max order_date"})

# Impute missing country from email domain
domain_cc = {"x.com":"US","z.com":"US","y.com":"DE"}
missing_country = df["country_iso2"].isna()
df.loc[missing_country, "country_iso2"] = (
    df.loc[missing_country, "email"].str.split("@").str[-1].map(domain_cc)
)
audit_log.append({"action":"impute_country_from_email_domain", "rows": int(missing_country.sum())})

This gives us our final dataset:

order_idorder_datecountry_iso2currencyamount_usdweight_kgemail
010012025-01-02 00:00:00+00:00USUSD1200.500.907184a@x.com
11002NaTUSEUR1140.000.680388b@y.com
210032025-01-04 21:00:00+00:00USUSDNaN0.500000c@z.com
310042024-12-31 00:00:00+00:00DEEUR144.00NaNd@x.com

And a minimal audit log:

[
  {'action': 'drop_invalid_email', 'rows': [1]},
  {'action': 'dedupe_order_id_latest', 'detail': 'keep max order_date'},
  {'action': 'impute_country_from_email_domain', 'rows': 1}
]

6) Document & Export

Finally, wrap it all up with a summary of what you changed and a little data dictionary. Bonus points for versioning your exports.

changelog = [
    "2025-10-06: Parsed dates to UTC; normalized currency to USD.",
    "2025-10-06: Standardized country to ISO2; converted weight to kg.",
    "2025-10-06: Dropped invalid emails; deduped on order_id by latest order_date.",
    "2025-10-06: Added validation checks (types, enums, ranges)."
]

data_dict = pd.DataFrame([
    {"field":"order_id", "type":"int", "desc":"Business key"},
    {"field":"order_date", "type":"datetime(UTC)", "desc":"Order timestamp"},
    {"field":"country_iso2", "type":"string[ISO2]", "desc":"Standardized country"},
    {"field":"currency", "type":"category{USD,EUR}", "desc":"Original currency"},
    {"field":"amount_usd", "type":"float", "desc":"Amount in USD"},
    {"field":"weight_kg", "type":"float", "desc":"Item weight in kilograms"},
    {"field":"email", "type":"string", "desc":"Customer contact (validated)"},
])
data_dict.to_csv("data_dictionary.csv", index=False)

run_id = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
out_path = f"orders_clean_{run_id}.parquet"
df.to_parquet(out_path, index=False)

At this point you’ve got reproducible code, clear rules, an audit trail, and versioned outputs, aka data munging that future you will thank you for.

From Data Munging to Data + AI Observability

Cleaning your data once is good. Keeping it clean is better. The same rules you just wrote can be turned into monitors that run continuously in production.

This is where data + AI observability tools like Monte Carlo come in. You can turn your data munging rules into always-on checks that monitor data as it flows through your pipelines. That means turning the rules we created earlier into real monitors, like:

  • order_id should always be unique.
  • email should match a valid pattern at least 99.9% of the time.
  • amount_usd should never be negative, and distribution should stay in check.

You can even get alerted when your model or pipeline drifts, even if you didn’t write a specific rule for it. You can also see the full data lineage, so if a source breaks, you can pinpoint what’s been affected downstream.

Want to see this whole thing, from munging to monitoring, in action? Drop your email and we’ll show you how to automate and optimize your data + AI quality management.

Our promise: we will show you the product.

Frequently Asked Questions

How does munging improve data quality?

Data munging improves data quality by cleaning, transforming, and structuring raw data so it becomes accurate, consistent, and usable. Through profiling, parsing, standardizing, validating, imputing, and documenting, data munging fixes issues like missing values, inconsistent formats, duplicates, and errors. This process ensures the data is reliable and ready for analysis or further processing.

What is the difference between data munging and wrangling?

There is no significant difference between data munging and data wrangling; the two terms are often used interchangeably. Both refer to the process of cleaning and preparing raw data for analysis by correcting errors, standardizing formats, and handling missing values.

What is an example of data munging?

An example of data munging is cleaning up a messy orders dataset by correcting inconsistent country codes, parsing dates into a standard format, converting currencies, validating email addresses, removing duplicates, and documenting the changes. This transforms the raw, inconsistent data into a clean, structured dataset ready for analysis.

What is the difference between data munging and ETL?

While data munging focuses on cleaning and structuring data for usability, ETL (Extract, Transform, Load) is a broader process that extracts data from various sources, transforms it (which can include munging), and loads it into a target system or database. ETL often deals with larger-scale, automated data movement and integration, whereas data munging is typically more hands-on and focused on making data analysis-ready.