← IndexEntry № 224·data

Interpret an A/B Test Result Without Overclaiming

Interprets A/B test numbers honestly, flags validity risks, and gives a ship or keep-running recommendation.

Optimized for
ChatGPTClaude
§ When to use this

The fastest way to lose trust in an experimentation program is to ship a result that was noise, then watch the metric snap back next quarter. This prompt is built to keep an LLM honest: it explicitly forbids inventing a p-value or confidence interval from raw counts, because a confident fake statistic is worse than none. Instead it asks the model to reason about whether your sample and duration are adequate, name the test it would run, and flag the validity threats that kill more experiments than small effect sizes do, peeking at results early, novelty wearing off, seasonality, and running ten variants then celebrating the lucky one. The output ends with an actual decision, ship, iterate, or keep running, because an analysis that does not lead to a call is just trivia. Use it as a structured second opinion before a launch meeting, and treat its statistical claims as a checklist of what to compute properly in your stats tool, not as the final verdict.

§ The Prompt— fill in the fields, then copy or open in a tool
§ Customize0/7 fields filled
your prompt — fill the fields above
You are a rigorous experimentation analyst. Interpret this A/B test honestly. Inputs:
Metric: [PRIMARY METRIC]
Control: [CONTROL N] users, [CONTROL RATE]
Variant: [VARIANT N] users, [VARIANT RATE]
Duration: [DURATION]
Pre-registered hypothesis: [HYPOTHESIS]

Do the following: (1) state the absolute and relative lift, (2) judge whether the sample size and duration look sufficient and whether the result is plausibly significant, naming any test you would run, (3) flag risks like peeking, novelty effects, seasonality, or multiple comparisons, and (4) give a clear recommendation: ship, iterate, or keep running. Do not fabricate a p-value or confidence interval from these numbers; instead explain what you would compute and what would change your call. Be plain-spoken, not falsely precise.
Open with your prompt →ChatGPTClaudeSends your filled-in prompt straight into a new chat.
§ Example Output

What you can expect back

Lift: variant is 3.6% vs 3.1%, an absolute gain of 0.5 points and a relative lift of about 16%. That is a meaningful effect size if real.

Adequacy: with ~8,000 per arm and ~250-290 conversions each, this is a reasonable but not huge sample for a ~3% base rate. I would run a two-proportion z-test (or a chi-square) on the raw counts to get a proper p-value and a confidence interval on the difference; do not treat the 16% as final until that is computed. Two full weeks covers weekly seasonality, which is good.

Risks: confirm results were not checked daily and called early (peeking inflates false positives); a shorter form may show a novelty bump that fades; verify this was the only metric tested, otherwise correct for multiple comparisons.

Recommendation: iterate toward ship. Compute the test; if it clears your threshold with the CI excluding zero, ship. If the interval is wide and crosses zero, keep running one more week.

Illustrative example — your results will vary by tool and inputs.

§ Pro Tips

Get sharper results

  • 01Give raw conversion counts, not just percentages; a proper significance test needs the numerators and denominators.
  • 02State your hypothesis honestly up front, post-hoc stories about why a result 'makes sense' are how teams fool themselves.
  • 03Always compute the actual test in a stats tool; use this output to decide which test and to catch validity threats.
  • 04Mention if you ran multiple variants or metrics so it can warn you about multiple-comparison inflation.
§ Variations

Adapt it for your case

Sample-size pre-check

Before launching, ask for the minimum detectable effect and runtime given your baseline rate and traffic.

Guardrail review

Add secondary metrics like refunds or load time and ask whether the win came at a hidden cost.

Plain summary for execs

Ask for a three-sentence version with the decision and the one risk to mention in the meeting.

Best For — Roles
Use For — Tasks
Tags#ab-testing#experimentation#statistics
§ FAQ

Common questions

Why won't it just tell me if it's significant?

Significance depends on a test you must run on the raw counts; the prompt deliberately stops the model from fabricating a p-value, which would be confidently wrong.

Is two weeks long enough?

Two full weekly cycles cover day-of-week seasonality, which is usually the minimum; longer helps if your buying cycle is long or traffic is low.

What is peeking and why does it matter?

Checking results repeatedly and stopping when they look good inflates false positives; decide your sample size and end date in advance and judge only then.

§ Related Entries

You may also need