Interpret an A/B Test Result Without Overclaiming
Interprets A/B test numbers honestly, flags validity risks, and gives a ship or keep-running recommendation.
The fastest way to lose trust in an experimentation program is to ship a result that was noise, then watch the metric snap back next quarter. This prompt is built to keep an LLM honest: it explicitly forbids inventing a p-value or confidence interval from raw counts, because a confident fake statistic is worse than none. Instead it asks the model to reason about whether your sample and duration are adequate, name the test it would run, and flag the validity threats that kill more experiments than small effect sizes do, peeking at results early, novelty wearing off, seasonality, and running ten variants then celebrating the lucky one. The output ends with an actual decision, ship, iterate, or keep running, because an analysis that does not lead to a call is just trivia. Use it as a structured second opinion before a launch meeting, and treat its statistical claims as a checklist of what to compute properly in your stats tool, not as the final verdict.
You are a rigorous experimentation analyst. Interpret this A/B test honestly. Inputs: Metric: [PRIMARY METRIC] Control: [CONTROL N] users, [CONTROL RATE] Variant: [VARIANT N] users, [VARIANT RATE] Duration: [DURATION] Pre-registered hypothesis: [HYPOTHESIS] Do the following: (1) state the absolute and relative lift, (2) judge whether the sample size and duration look sufficient and whether the result is plausibly significant, naming any test you would run, (3) flag risks like peeking, novelty effects, seasonality, or multiple comparisons, and (4) give a clear recommendation: ship, iterate, or keep running. Do not fabricate a p-value or confidence interval from these numbers; instead explain what you would compute and what would change your call. Be plain-spoken, not falsely precise.
What you can expect back
Lift: variant is 3.6% vs 3.1%, an absolute gain of 0.5 points and a relative lift of about 16%. That is a meaningful effect size if real. Adequacy: with ~8,000 per arm and ~250-290 conversions each, this is a reasonable but not huge sample for a ~3% base rate. I would run a two-proportion z-test (or a chi-square) on the raw counts to get a proper p-value and a confidence interval on the difference; do not treat the 16% as final until that is computed. Two full weeks covers weekly seasonality, which is good. Risks: confirm results were not checked daily and called early (peeking inflates false positives); a shorter form may show a novelty bump that fades; verify this was the only metric tested, otherwise correct for multiple comparisons. Recommendation: iterate toward ship. Compute the test; if it clears your threshold with the CI excluding zero, ship. If the interval is wide and crosses zero, keep running one more week.
Illustrative example — your results will vary by tool and inputs.
Get sharper results
- 01Give raw conversion counts, not just percentages; a proper significance test needs the numerators and denominators.
- 02State your hypothesis honestly up front, post-hoc stories about why a result 'makes sense' are how teams fool themselves.
- 03Always compute the actual test in a stats tool; use this output to decide which test and to catch validity threats.
- 04Mention if you ran multiple variants or metrics so it can warn you about multiple-comparison inflation.
Adapt it for your case
Before launching, ask for the minimum detectable effect and runtime given your baseline rate and traffic.
Add secondary metrics like refunds or load time and ask whether the win came at a hidden cost.
Ask for a three-sentence version with the decision and the one risk to mention in the meeting.
Common questions
Why won't it just tell me if it's significant?
Significance depends on a test you must run on the raw counts; the prompt deliberately stops the model from fabricating a p-value, which would be confidently wrong.
Is two weeks long enough?
Two full weekly cycles cover day-of-week seasonality, which is usually the minimum; longer helps if your buying cycle is long or traffic is low.
What is peeking and why does it matter?
Checking results repeatedly and stopping when they look good inflates false positives; decide your sample size and end date in advance and judge only then.
You may also need
Choose the Right Chart Type for Your Data
Recommends the best and runner-up chart type for your message, variables, and audience with encoding guidance.
Write a Precise, Unambiguous Metric Definition
Produces a precise metric definition with formula, grain, inclusion rules, and resolved edge cases.
Write a Dashboard Spec Before You Build It
Drafts a build-ready dashboard spec with KPIs, layout, filters, drill-downs, and out-of-scope guardrails.
Explain a Statistical Concept to Non-Technical Stakeholders
Translates a statistical concept into a plain-language explanation tied to a real decision for non-technical stakeholders.