Build a Data-Cleaning Checklist for a New Dataset
Generates a prioritized, column-specific data-cleaning checklist tailored to your dataset and intended analysis.
Jumping into analysis before cleaning is how a units mismatch or a pile of duplicate rows turns into a wrong conclusion that survives all the way to a slide. This prompt produces a prioritized, column-specific checklist tailored to your actual dataset and, crucially, to what you plan to do with it, because the cleaning that matters depends on the analysis. Outliers barely matter for a count but wreck an average; duplicate rows are fatal for a sum and harmless for a distinct list. By grouping checks into structural, value, and consistency tiers and ordering them by how badly each could distort your specific analysis, you fix the dangerous things first instead of polishing columns you will never use. The decision rule for each item, fix versus flag, keeps you from silently 'correcting' data you should have escalated. Use it the moment a new dataset lands, before you trust a single aggregate from it, and re-run it whenever the source or intended analysis changes.
You are a meticulous data analyst. Build a prioritized data-cleaning checklist for the dataset below before I trust it for analysis. Dataset: [DATASET DESCRIPTION]. Columns and types: [COLUMNS AND TYPES]. Intended use: [INTENDED ANALYSIS]. Known issues: [KNOWN ISSUES]. Output a checklist grouped into: structural checks (schema, types, duplicates, keys), value checks (NULLs, outliers, ranges, units, categorical typos), and consistency checks (cross-field logic, time ordering, referential integrity). For each item give the specific column(s) to inspect, a quick test to run, and the decision rule for fixing versus flagging. Order items by how badly each could distort [INTENDED ANALYSIS]. Do not invent column names beyond those I listed. End with three questions whose answers change the cleaning plan.
What you can expect back
Structural (highest impact): - Duplicate tickets: check ticket_id for repeats (df.ticket_id.duplicated().sum()). Reopened exports double-count rows and skew medians. Rule: dedupe to one row per ticket_id by latest created_at; flag if duplicates disagree on resolution_minutes. Value: - Blank priority: count NULLs in priority. Median 'by priority' silently drops or misbuckets these. Rule: flag, do not guess; report an 'Unknown' bucket. - resolution_minutes outliers and ranges: check for negatives and absurd highs (e.g. > 30 days). Negatives are impossible and distort even medians slightly. Rule: fix negatives to NULL, flag extreme highs for review. - priority typos: list distinct values; 'High' vs 'high' split your groups. Rule: standardize casing. Consistency: - agent NULLs: count blanks; 'by agent' breaks without them. Rule: flag. Questions that change the plan: (1) For reopened tickets, should resolution_minutes be the first or final resolution? (2) Are blank priorities truly unknown or a default? (3) Is resolution_minutes wall-clock or business hours?
Illustrative example — your results will vary by tool and inputs.
Get sharper results
- 01Tell it the intended analysis precisely; 'median by group' and 'total sum' demand opposite cleaning priorities.
- 02Always run the distinct-values check on categoricals first; casing and whitespace silently split your groups.
- 03Prefer flag over fix for anything judgment-based; never overwrite source data without recording what you changed.
- 04Re-run the checklist after a fresh export, upstream changes reintroduce the same issues.
Adapt it for your case
Add 'write each test as a pandas or SQL snippet' to get a runnable validation script.
Ask what summary stats and distributions to pull before deciding on fixes, to ground the plan in reality.
Ask which checks belong as automated assertions in the ETL so bad data is caught on ingest.
Common questions
Why does the order depend on my analysis?
A duplicate row barely changes a distinct count but doubles a sum and can shift a median; ordering by impact on your specific calculation fixes the dangerous issues first.
Should I fix everything it finds?
No. Mechanical issues like type or casing are safe fixes; anything involving judgment, like dropping outliers, should be flagged and decided with context, not silently changed.
It listed a column I don't have, what happened?
It should not, the prompt forbids inventing columns; if it does, your pasted column list was probably incomplete, so paste the exact header row.
You may also need
Write a Pandas Snippet for a Specific Data Transformation
Generates a vectorized, commented pandas snippet for a precise transformation on your actual DataFrame.
Write a Precise, Unambiguous Metric Definition
Produces a precise metric definition with formula, grain, inclusion rules, and resolved edge cases.
Choose the Right Chart Type for Your Data
Recommends the best and runner-up chart type for your message, variables, and audience with encoding guidance.
Interpret an A/B Test Result Without Overclaiming
Interprets A/B test numbers honestly, flags validity risks, and gives a ship or keep-running recommendation.