← IndexEntry № 224·data

Interpret an A/B Test Result Without Overclaiming

Interprets A/B test numbers honestly, flags validity risks, and gives a ship or keep-running recommendation.

Optimized for

§ When to use this

The fastest way to lose trust in an experimentation program is to ship a result that was noise, then watch the metric snap back next quarter. This prompt is built to keep an LLM honest: it explicitly forbids inventing a p-value or confidence interval from raw counts, because a confident fake statistic is worse than none. Instead it asks the model to reason about whether your sample and duration are adequate, name the test it would run, and flag the validity threats that kill more experiments than small effect sizes do, peeking at results early, novelty wearing off, seasonality, and running ten variants then celebrating the lucky one. The output ends with an actual decision, ship, iterate, or keep running, because an analysis that does not lead to a call is just trivia. Use it as a structured second opinion before a launch meeting, and treat its statistical claims as a checklist of what to compute properly in your stats tool, not as the final verdict.

§ The Prompt— fill in the fields, then copy or open in a tool

§ Customize0/7 fields filled

Primary metric[PRIMARY METRIC]The one metric the test was designed to move, e.g. 'checkout conversion rate'.

Control n[CONTROL N]Number of users in control, e.g. '8,200'.

Control rate[CONTROL RATE]Control's result on the metric, e.g. '3.1% (254 conversions)'.

Variant n[VARIANT N]Number of users in the variant, e.g. '8,050'.

Variant rate[VARIANT RATE]Variant's result, e.g. '3.6% (290 conversions)'.

Duration[DURATION]How long it ran, e.g. '14 days, two full weekly cycles'.

Hypothesis[HYPOTHESIS]What you predicted before starting, e.g. 'a shorter checkout form increases conversion'.

your prompt — fill the fields above

You are a rigorous experimentation analyst. Interpret this A/B test honestly. Inputs:
Metric: [PRIMARY METRIC]
Control: [CONTROL N] users, [CONTROL RATE]
Variant: [VARIANT N] users, [VARIANT RATE]
Duration: [DURATION]
Pre-registered hypothesis: [HYPOTHESIS]

Do the following: (1) state the absolute and relative lift, (2) judge whether the sample size and duration look sufficient and whether the result is plausibly significant, naming any test you would run, (3) flag risks like peeking, novelty effects, seasonality, or multiple comparisons, and (4) give a clear recommendation: ship, iterate, or keep running. Do not fabricate a p-value or confidence interval from these numbers; instead explain what you would compute and what would change your call. Be plain-spoken, not falsely precise.

Open with your prompt →ChatGPT ↗Claude ↗Sends your filled-in prompt straight into a new chat.

§ Example Output

What you can expect back

Lift: variant is 3.6% vs 3.1%, an absolute gain of 0.5 points and a relative lift of about 16%. That is a meaningful effect size if real.

Adequacy: with ~8,000 per arm and ~250-290 conversions each, this is a reasonable but not huge sample for a ~3% base rate. I would run a two-proportion z-test (or a chi-square) on the raw counts to get a proper p-value and a confidence interval on the difference; do not treat the 16% as final until that is computed. Two full weeks covers weekly seasonality, which is good.

Risks: confirm results were not checked daily and called early (peeking inflates false positives); a shorter form may show a novelty bump that fades; verify this was the only metric tested, otherwise correct for multiple comparisons.

Recommendation: iterate toward ship. Compute the test; if it clears your threshold with the CI excluding zero, ship. If the interval is wide and crosses zero, keep running one more week.

Illustrative example — your results will vary by tool and inputs.

§ Pro Tips

Get sharper results

01Give raw conversion counts, not just percentages; a proper significance test needs the numerators and denominators.
02State your hypothesis honestly up front, post-hoc stories about why a result 'makes sense' are how teams fool themselves.
03Always compute the actual test in a stats tool; use this output to decide which test and to catch validity threats.
04Mention if you ran multiple variants or metrics so it can warn you about multiple-comparison inflation.

§ Variations

Adapt it for your case

Sample-size pre-check

Before launching, ask for the minimum detectable effect and runtime given your baseline rate and traffic.

Guardrail review

Add secondary metrics like refunds or load time and ask whether the win came at a hidden cost.

Plain summary for execs

Ask for a three-sentence version with the decision and the one risk to mention in the meeting.

Best For — Roles

Use For — Tasks

Analyzing Data

Tags#ab-testing#experimentation#statistics

§ FAQ

Common questions

Why won't it just tell me if it's significant?

Significance depends on a test you must run on the raw counts; the prompt deliberately stops the model from fabricating a p-value, which would be confidently wrong.

Is two weeks long enough?

Two full weekly cycles cover day-of-week seasonality, which is usually the minimum; longer helps if your buying cycle is long or traffic is low.

What is peeking and why does it matter?

Checking results repeatedly and stopping when they look good inflates false positives; decide your sample size and end date in advance and judge only then.

§ Related Entries

You may also need

№ 223data

Choose the Right Chart Type for Your Data

Recommends the best and runner-up chart type for your message, variables, and audience with encoding guidance.

For

chatgpt·claude

№ 225data

Write a Precise, Unambiguous Metric Definition

Produces a precise metric definition with formula, grain, inclusion rules, and resolved edge cases.

For

chatgpt·claude

№ 227data

Write a Dashboard Spec Before You Build It

Drafts a build-ready dashboard spec with KPIs, layout, filters, drill-downs, and out-of-scope guardrails.

For

chatgpt·claude

№ 228data

Explain a Statistical Concept to Non-Technical Stakeholders

Translates a statistical concept into a plain-language explanation tied to a real decision for non-technical stakeholders.

For

chatgpt·claude