← IndexEntry № 226·data

Build a Data-Cleaning Checklist for a New Dataset

Generates a prioritized, column-specific data-cleaning checklist tailored to your dataset and intended analysis.

Optimized for

§ When to use this

Jumping into analysis before cleaning is how a units mismatch or a pile of duplicate rows turns into a wrong conclusion that survives all the way to a slide. This prompt produces a prioritized, column-specific checklist tailored to your actual dataset and, crucially, to what you plan to do with it, because the cleaning that matters depends on the analysis. Outliers barely matter for a count but wreck an average; duplicate rows are fatal for a sum and harmless for a distinct list. By grouping checks into structural, value, and consistency tiers and ordering them by how badly each could distort your specific analysis, you fix the dangerous things first instead of polishing columns you will never use. The decision rule for each item, fix versus flag, keeps you from silently 'correcting' data you should have escalated. Use it the moment a new dataset lands, before you trust a single aggregate from it, and re-run it whenever the source or intended analysis changes.

§ The Prompt— fill in the fields, then copy or open in a tool

§ Customize0/4 fields filled

Dataset description[DATASET DESCRIPTION]What the data is and where it came from, e.g. 'a CSV export of 90 days of support tickets from Zendesk'.

Columns and types[COLUMNS AND TYPES]The columns with types, e.g. 'ticket_id (str), created_at (datetime), priority (categorical), resolution_minutes (int), agent (str)'.

Intended analysis[INTENDED ANALYSIS]What you'll compute, e.g. 'median resolution time by priority and by agent'.

Known issues[KNOWN ISSUES]Anything you already suspect, e.g. 'some tickets reopened and exported twice; priority sometimes blank'.

your prompt — fill the fields above

You are a meticulous data analyst. Build a prioritized data-cleaning checklist for the dataset below before I trust it for analysis. Dataset: [DATASET DESCRIPTION]. Columns and types: [COLUMNS AND TYPES]. Intended use: [INTENDED ANALYSIS]. Known issues: [KNOWN ISSUES].

Output a checklist grouped into: structural checks (schema, types, duplicates, keys), value checks (NULLs, outliers, ranges, units, categorical typos), and consistency checks (cross-field logic, time ordering, referential integrity). For each item give the specific column(s) to inspect, a quick test to run, and the decision rule for fixing versus flagging. Order items by how badly each could distort [INTENDED ANALYSIS]. Do not invent column names beyond those I listed. End with three questions whose answers change the cleaning plan.

Open with your prompt →ChatGPT ↗Claude ↗Sends your filled-in prompt straight into a new chat.

§ Example Output

What you can expect back

Structural (highest impact):
- Duplicate tickets: check ticket_id for repeats (df.ticket_id.duplicated().sum()). Reopened exports double-count rows and skew medians. Rule: dedupe to one row per ticket_id by latest created_at; flag if duplicates disagree on resolution_minutes.

Value:
- Blank priority: count NULLs in priority. Median 'by priority' silently drops or misbuckets these. Rule: flag, do not guess; report an 'Unknown' bucket.
- resolution_minutes outliers and ranges: check for negatives and absurd highs (e.g. > 30 days). Negatives are impossible and distort even medians slightly. Rule: fix negatives to NULL, flag extreme highs for review.
- priority typos: list distinct values; 'High' vs 'high' split your groups. Rule: standardize casing.

Consistency:
- agent NULLs: count blanks; 'by agent' breaks without them. Rule: flag.

Questions that change the plan: (1) For reopened tickets, should resolution_minutes be the first or final resolution? (2) Are blank priorities truly unknown or a default? (3) Is resolution_minutes wall-clock or business hours?

Illustrative example — your results will vary by tool and inputs.

§ Pro Tips

Get sharper results

01Tell it the intended analysis precisely; 'median by group' and 'total sum' demand opposite cleaning priorities.
02Always run the distinct-values check on categoricals first; casing and whitespace silently split your groups.
03Prefer flag over fix for anything judgment-based; never overwrite source data without recording what you changed.
04Re-run the checklist after a fresh export, upstream changes reintroduce the same issues.

§ Variations

Adapt it for your case

Generate the checks as code

Add 'write each test as a pandas or SQL snippet' to get a runnable validation script.

Profiling first

Ask what summary stats and distributions to pull before deciding on fixes, to ground the plan in reality.

Pipeline guardrails

Ask which checks belong as automated assertions in the ETL so bad data is caught on ingest.

Best For — Roles

Use For — Tasks

Analyzing Data

Tags#data-cleaning#data-quality#etl

§ FAQ

Common questions

Why does the order depend on my analysis?

A duplicate row barely changes a distinct count but doubles a sum and can shift a median; ordering by impact on your specific calculation fixes the dangerous issues first.

Should I fix everything it finds?

No. Mechanical issues like type or casing are safe fixes; anything involving judgment, like dropping outliers, should be flagged and decided with context, not silently changed.

It listed a column I don't have, what happened?

It should not, the prompt forbids inventing columns; if it does, your pasted column list was probably incomplete, so paste the exact header row.

§ Related Entries

You may also need

№ 222data

Write a Pandas Snippet for a Specific Data Transformation

Generates a vectorized, commented pandas snippet for a precise transformation on your actual DataFrame.

For

chatgpt·claude

№ 225data

Write a Precise, Unambiguous Metric Definition

Produces a precise metric definition with formula, grain, inclusion rules, and resolved edge cases.

For

chatgpt·claude

№ 223data

Choose the Right Chart Type for Your Data

Recommends the best and runner-up chart type for your message, variables, and audience with encoding guidance.

For

chatgpt·claude

№ 224data

Interpret an A/B Test Result Without Overclaiming

Interprets A/B test numbers honestly, flags validity risks, and gives a ship or keep-running recommendation.

For

chatgpt·claude