Data Talent Hiring,
Rebuilt for the AI era.

Watch candidates solve real business cases in a notebook with an AI assistant.
See every action and every prompt — learn how they think, and how they work with AI.

01 — The problem

Your interview process
wasn't built for AI.

Interview before AI
The old data interview is broken
  • Live SQL coding test
    Pasted into ChatGPT in a second tab.
  • Python & algorithm screen
    LeetCode is solved instantly. The job isn't for-loops anymore.
  • Verbal / whiteboard case
    Rehearsed nightly with AI. You hear a script, not reasoning.
  • Resume & experience dig
    Every bullet was rewritten by an LLM.
Interview after AI
Test the actual job
  • Frame the right question
    Open-ended brief, real dataset. Can they scope it and push back?
  • Focus on judgment & domain expertise
    Catch the leaky join, the wrong baseline — or ship it?
  • Collaborate with the AI assistant
    Same notebook and copilots they'll use on day one.
02 — How it works

The same notebook
they'll use on the job.

We capture every action, every line of code, every prompt.

REC · 00:00
Python 3.11 · idle 1 cells

Getting Started

Your data is in a database. Use SQL or Python to explore it:

SQL: SELECT * FROM case_data LIMIT 10

Python: df = pd.read_csv('case_data.csv')

See data_dictionary.md for column descriptions.

03 — What we measure

We measure what other
interviews can't see.

A data scientist thinking through charts and questions
What they bring beyond AI

Judgment, framing, and taste.

How they frame an ambiguous problem, what tradeoffs they prioritize, and whether they can tell when an answer is actually useful — not just technically correct.

A data scientist collaborating with an LLM across SQL, notebook, and insights
How they work with AI

Verification, pushback, restraint.

Did they verify the AI's output? Did they catch its mistakes? Did they over-trust it? We watch the full session — every prompt, every accepted suggestion, every override.

A complete evaluation report.

Research-backed evaluation rubrics. Every score traces back to the exact evidence.

Sample Report Template
Workspace  /  Reviews  /  Session

Review: Sample Candidate

Sample interview · Assessment submitted Apr 30, 2026, 6:08 PM

Duration 25 min
AI prompts 1
Status Submitted
Overall fit score
7.8 / 10
Exceptional
Summary
Role fit Exceptional

Strong fit for the analyst role's core ask — the candidate framed the business question before reaching for tooling and used AI as a focused executor. Worth probing in interview: recommendations stayed at the diagnostic level and didn't push into the operational cutoffs this role owns.

Final Deliverable Above bar

Reaches a defensible, evidence-backed recommendation but stops short of an operational cutoff.

AI Collaboration Exceptional

Delegated one well-scoped task to AI and read the output critically instead of paraphrasing it.

Role-level Expertise Above bar

Candidate-authored work shows solid domain framing; independent statistical depth is good, not standout.

Your decision
Good Fit
Hold
No Fit
Notes Auto-saved · just now
Dimensions & evidence
Final Deliverable
Above bar

The notebook reaches a defensible recommendation — a basic lift check kills the initial hypothesis and a controlled model isolates the real driver. The work stops at the diagnostic level; an operational cutoff is left for a follow-up.

Evidence · 3
  • +
    Recommendation is tied to specific model output — names the significant variable and its direction, not a vague summary Section 2 Q1
  • +
    Killed the original hypothesis with a basic lift check before modeling — disciplined sequencing Cell #1
  • No operational threshold proposed — the recommendation stays diagnostic and doesn't reach the cutoff this role owns Section 2 Q1
AI Collaboration
Exceptional

Used AI as a precise executor: one well-scoped prompt with named variables and the exact output wanted, then read the result table critically rather than paraphrasing it.

Evidence · 3
  • +
    A single prompt specified the model, the four named controls, and the output columns wanted — no back-and-forth needed msg #1
  • +
    Cited specific coefficient values from the AI-generated cell as the basis for the conclusion Section 2 Q1
  • +
    Cross-checked the AI output against an earlier hand-written cell before trusting it — convergent evidence Cell #3
Role-level Expertise
Above bar

Candidate-authored work shows solid domain framing — quiz answers name industry factors not visible in the data, and the written findings reflect stakeholder constraints. Independent statistical reasoning is good, not standout.

Evidence · 3
  • +
    Quiz Q1 invokes domain-specific risk factors the dataset alone wouldn't surface Quiz Q1
  • +
    Section 2 frames the finding around operational load and downstream impact — stakeholder-aware Section 2 Q1
  • Did not independently interrogate a borderline non-significant control — accepted the model spec as generated Cell #4
Action Timeline
Every AI prompt, cell run, and edit — in order.
00:00
Session started
Candidate opened the IDE and read the prompt.
— · 0/30 min
02:14
Cell #1 run · explored data shape + lift
~3,000 rows × 7 cols. First-pass lift of the focal segment: ~1.04x — barely meaningful.
Cell #1
04:02
Cell #2 run · segmented metric by quartile
Looked at how the target rate varied across four buckets of the candidate variable.
Cell #2
04:47
Prompted AI · msg #1
Hypothesis-driven request for a regression with explicit controls.
"Fit a model of the target on the candidate variable plus three controls. Show coefficients with 95% CIs and p-values…"
msg #1 · 1 / ∞
05:18
Cell #3 added · AI-generated cell
Candidate kept the AI-suggested cell after reviewing the diff.
[AI-generated]
06:11
Cell #4 run · reviewed model output
Read the coefficient table; flagged the dominant variable and dismissed two non-significant ones.
Cell #4
06:42
Moved to Section 2 · Findings
Investigate phase complete. Notebook locked for reference.
phase 2 → 3
07:53
Submitted answer
~1,500-character recommendation with confidence note and proposed next-step validation.
1 prompt used

Case

[Case title goes here]

A short business framing for the case appears here — who the team is, what they're trying to decide, and the constraints they're operating under. Two or three sentences set the stage without prescribing the answer.

Pre-assessment quiz · 3 framing questions
Quiz Q1

[Framing question 1 goes here — sets up the business hypothesis the candidate needs to push back on.]

[Sample candidate response — typically 2–4 sentences naming the candidate's mental model, the variables they'd reach for, and the assumption they want to test. Long-form free text, no character cap.]

~600 chars
Quiz Q2

[Framing question 2 goes here — asks the candidate where they'd start the analysis and why.]

[Sample candidate response — describes the first cut of the data they'd run, what they'd compare it against, and which secondary check would either confirm or kill the hypothesis.]

~570 chars
Quiz Q3

[Framing question 3 goes here — drops a partial statistic on the candidate and asks what they'd interpret and check next.]

[Sample candidate response — names the missing baseline, describes the lift calculation they'd want to do, and adds a sanity-check on a likely confounder.]

~590 chars
Notebook · final state
Markdown Cell 1

Getting Started

Your data is in a database. Use SQL or Python to explore it:

SQL: SELECT * FROM case_data LIMIT 10
Python: df = pd.read_csv('case_data.csv')

See data_dictionary.md for column descriptions.
Python In [1]:
import pandas as pd
df = pd.read_csv('case_data.csv')

# Headline numbers
outcome_rate = df['target'].mean()
segment_share = (df['segment_flag'] == 1).mean()
print(f'overall outcome rate: {outcome_rate:.4f}')
print(f'segment share overall: {segment_share:.4f}')

# Lift: P(segment | event) / P(segment)
events = df[df['target'] == 1]
seg_among = (events['segment_flag'] == 1).mean()
print(f'lift: {seg_among / segment_share:.3f}x')
overall outcome rate: 0.0xxx
segment share overall: 0.xxxx
lift: 1.0xx   (>1 means over-represented)
Python In [2]:
# Outcome rate stratified by quartile of candidate variable
df['var_q'] = pd.qcut(df['candidate_var'], 4,
                     labels=['Q1 low','Q2','Q3','Q4 high'])
print(df.groupby('var_q')['target'].mean().round(4))
var_q
Q1 low     0.0xxx
Q2         0.0xxx
Q3         0.0xxx
Q4 high    0.xxxx
Name: target, dtype: float64
Python In [3]: [AI-GENERATED]
import statsmodels.api as sm

features = ['segment_flag', 'candidate_var', 'control_a',
            'control_b']
X = sm.add_constant(df[features])
y = df['target']
model = sm.Logit(y, X).fit(disp=False)
print(model.summary())
                  coef     OR      ci_low   ci_high   p_value
const            -x.xxxx  0.0xx   0.0xx    0.0xx     0.000
segment_flag      0.0xxx  1.0xx   0.8xx    1.4xx     0.6xx
candidate_var     x.xxxx  xx.xx   xx.xx    xxx.xx    0.000
control_a        -0.0000  1.000   1.000    1.000     0.3xx
control_b         0.0000  1.000   1.000    1.000     0.2xx
Section 2 · Share your findings
Q1

[Final-answer prompt goes here — asks the candidate to summarize findings and make a recommendation to a named stakeholder.]

[Sample candidate final answer — leads with the recommendation, then 2–3 supporting bullets, then a confidence note and a proposed validation step.] What I found: — First supporting bullet: the original hypothesis didn't survive the basic lift check. — Second bullet: the real signal sits on a different variable, with a clean monotonic gradient across quartiles. — Third bullet: a multivariate model confirms the direction — the focal variable is non-significant once controls are added; the alternative is strongly significant. Recommendation: ship the alternative routing rule; hold off on the original. Confidence: high on direction, medium on cutoff — next step is a holdout validation on a recent vintage.

~1,500 chars
AI Chat Timeline
Every prompt the candidate sent and the AI's response, in order.
04:47 · msg #1
You
[Sample user prompt — candidate names the hypothesis they're testing, the early signal they've already seen, and asks the assistant for a specific multivariate model with named variables. They specify what output they want — coefficient table, p-values, confidence intervals — and what they'll be looking for to confirm or reject the hypothesis.]
AI Assistant
[Sample assistant reply — confirms the cell was added, names the columns of the output table the candidate should expect, and points out what pattern in the output would support or contradict the candidate's hypothesis. Stays neutral; doesn't volunteer a conclusion.]
Resume on file — generated 5 resume-based questions and 5 case-based questions to probe in the live interview.
Resume-based questions 5 questions
Claim verification

[Resume-claim verification question — references a specific result the candidate cited and asks them to walk through how they validated it.]

Why: Resume cites a headline outcome — verify the analytical chops behind it.

Strong answer looks like: Specific lift number, holdout-validated, names the counterfactual (control group, propensity match, or A/B).

Resume · prior role
Process probe

[Process-probe question — references a project on the resume and asks how the candidate handled the operational constraints around it.]

Why: Candidate showed operational sensitivity in Section 2; resume implies prior experience with this kind of constraint.

Strong answer looks like: Names a specific capacity constraint, throughput target, or SLA. Bonus: how they tuned the policy threshold to honor it.

Resume · prior project
Case-based questions 5 questions
Probe critical eval

[Critical-eval question — surfaces a specific output the candidate accepted at face value and asks them to interpret it more carefully.]

Why: Candidate accepted a headline statistic without flagging a unit or scaling caveat. Critical-eval gap.

Strong answer looks like: Names the unit on the variable, restates the result on an operationally readable scale, and notes the caveat for stakeholders.

Cell #4 at min 6
Depth check

[Depth-check question — acknowledges the candidate stopped at a coarse cut and asks how they'd land a specific operational threshold with more time.]

Why: Strong recommendation but no operational threshold — probe whether they know how to land it.

Strong answer looks like: Names a method (precision-recall curve at varying cutoffs, target-risk inversion, capacity-constrained selection).

Section 2 Q1
Hold out concern

[Hold-out-concern question — surfaces a model-quality metric the candidate didn't flag and asks whether it would change their ship/no-ship call.]

Why: Did not flag a low model-fit metric as a concern. Probe whether they know how to weigh "directionally right but low explanatory power" findings.

Strong answer looks like: Discusses the trade-off between sign + significance (which we have) and absolute predictive power (which we don't); proposes adding interaction terms or non-linear features.

Cell #4 at min 6
04 — Customized cases

Customized cases,
just for you.

The same business problems your team faces — modeled in synthetic data engineered for real analytical depth.

Judgment
& Framing
Less
Judgment & Framing
Without AI AI-Native
Talk-based

Verbal case interviews

Talk through a problem. No hands on the data, no AI.

Subjective No Data No AI
Best For AI-Native Data Hiring Customized case + AI notebook

Not just talk. Not another coding test. A customized data project.

Measures judgment & AI collaboration
Customized Case AI Collaborator Evidence Based
Algorithmic

Coding tests

LeetCode-style algorithm tasks without AI.

Code-first No Judgment No AI
Code first

Coding Tests + AI

AI allowed, the goal is still get the code right.

AI-allowed Output focused Narrow

Skills tested

Metric definition A/B testing Segmentation Causal inference Forecasting Cohort analysis Funnel diagnostics Attribution Experiment design and more

Industries

AdTech & Marketing E-commerce Finance Healthcare Marketplace SaaS & B2B Consumer apps Fintech Media & streaming and more
05 — Compare

The only platform built for
how analysts actually work.

HackerRank
CodeSignal
CoderPad
LitMetrics
Case content SQL + Python algorithm tasks Standardized DS task batteries Interviewer-brought notebooks or take-homes
Real-world DS cases — messy data, business framing
AI policy Banned or flagged Limited, discouraged Up to the interviewer
AI required — full notebook + assistant, same as the actual job
What's measured Code correctness + speed Benchmarked task performance Code quality + communication (interviewer-scored)
3-axis rubric — Final Deliverable, AI Collaboration, Domain Expertise
Workflow realism Single coding window Guided single-task workspace Live notebook + chat
Framing quiz → AI-native IDE → written findings (full loop)
Report evidence Score + a few snippets Rubric score, percentile Interviewer notes
Every scoring result cites a specific cell, prompt, or quiz answer
Hiring-fit read Generic percentile Standardized benchmark Interviewer judgment
Summary anchored against your JD + hiring priorities
Interview prep None None Live-session notes
5 case + 5 resume follow-up questions, evidence-tagged
"
The data job has been quietly rebuilt around AI. Writing code isn't the work anymore — it's framing the right question, judging what the model gives back, and knowing when to push back. That's not on a résumé. You only see it in the work.
Jules Malin
Jules Malin · Co-founder & CEO, LitMetrics
Ex-Director, Data Science & ML/AI, GoPro
Adjunct Professor, University of San Diego

Try it free on a real hire.Apply for early access.

Apply for early access