Beyond CAPTCHAs: Using Keystroke Biometrics to Keep Typing Test Leaderboards Fair in 2026

Why keystroke biometrics are the next step after CAPTCHAs

If you run a typing test site in 2026, you already know: off‑the‑shelf CAPTCHAs are a speed bump, not a stop sign. Communities regularly spot absurd results (think 400–600 WPM sprints) that are almost certainly scripts or auto‑typers rather than humans. (reddit.com)

Keystroke dynamics—how people time key presses and releases—adds a passive, behind‑the‑scenes signal that’s incredibly hard for bots to fake consistently. Recent large‑scale evaluations show verification systems trained on hundreds of thousands of users can reach equal‑error rates (EER) around 3–4% with a single global threshold, and even sub‑1% per‑user with a few short enrollment samples. (arxiv.org)

What the latest benchmarks say (and why that matters)

The KVC‑onGoing keystroke verification challenge aggregates public Aalto keystroke datasets—tweet‑length, free‑text sequences from 185k+ people, captured on both desktop and mobile keyboards. On the evaluation set, state‑of‑the‑art systems achieved about 3.33% EER on desktop and 3.61% on mobile; at a fixed 1% false‑match rate (FMR), the false‑non‑match rate (FNMR) was ~11.96% desktop and ~17.44% mobile. The study also noted age/gender effects that aren’t negligible, underscoring the need to monitor fairness. (arxiv.org)

Type2Branch, a top performer on this benchmark, reports mean per‑subject EERs as low as 0.77% (desktop) and 1.03% (mobile) with five ~50‑character enrollment samples; with a single global threshold, EERs were 3.25% (desktop) and 3.61% (mobile). That’s the kind of accuracy that can quietly filter out most bots without putting honest users through hoops. (arxiv.org)

Earlier, TypeNet demonstrated that deep models scale to internet‑size populations—100k+ users—with only moderate degradation, using the same Aalto data family (over 136 million keystrokes). (arxiv.org)

A practical, privacy‑first anti‑cheat blueprint

Here’s a layered design you can ship today.

1) Passive timing layer (always on, zero friction)

Collect only event timings—not the text—such as key down/up durations (hold times) and inter‑key intervals (DD/UD), plus simple dispersion metrics (variance, burstiness). Store no raw characters. This is privacy‑by‑design data minimization. (gdpr.org)
Run a compact verification model (e.g., a distilled TypeNet/Type2Branch‑style embedding with a global threshold) per session to produce a “human‑likeness” score. Maintain separate models or thresholds for desktop vs mobile, because mobile error and variability are higher in benchmarks. (arxiv.org)
Flag patterns that are highly implausible for humans: near‑constant inter‑key spacing, sub‑10 ms hold times repeated, or whole‑word bursts (often from swipe/IME or scripts). Community reports show such signatures trigger anti‑cheat on popular sites. (reddit.com)

2) Lightweight human check (only on anomalies)

If the passive layer flags risk, ask for a 10–20 second micro‑verification: a short phrase with punctuation and numbers, randomized capitalization, and copy/paste blocked. Score WPM accuracy and the new timing sample—users who pass go straight back to racing.
Keep checks device‑aware. On mobile, offer an option to temporarily disable swipe input (since swipe inserts words in one chunk) or route swipe entries to a separate mobile‑only leaderboard. (reddit.com)

3) Anti‑spoofing (liveness for behavior)

Synthetic timing forgeries are real: researchers showed that mimicry generated from screen‑recorded typing can evade naïve systems up to 64% of the time. Train a small classifier to distinguish natural human “micro‑jitter” and burst shapes from synthetic sequences, using public liveness/synthesis datasets and tools (e.g., KSDSLD). (journalofbigdata.springeropen.com)

4) Privacy‑by‑design from day one

Minimize: log only timing vectors or anonymized histograms; never store the text. Pseudonymize user IDs; set tight retention windows (e.g., 30–90 days). Communicate this clearly. (gdpr.org)
Consider training via federated learning with differential privacy to avoid centralizing raw behavioral data. Recent work shows FL variants with domain adaptation or heterogeneous DP can preserve accuracy while protecting users. (sciencedirect.com)

Desktop vs mobile: tune for the device

Event quality differs. Desktop browsers deliver consistent keydown/keyup events; mobile IMEs may buffer or emit “whole word” updates, which look bot‑like unless you detect and handle them. Community experiences confirm swipe/glide inputs often violate per‑keystroke assumptions. (reddit.com)
Benchmarks back this up: at 1% FMR, mobile FNMR is notably higher than desktop (≈17.44% vs 11.96% in KVC). Use per‑device thresholds and avoid penalizing mobile users for OS/IME quirks. (arxiv.org)
Consider separate leaderboards or caps for certain mobile modes, and make the rules explicit.

Thresholds that feel fair (and how to set them)

Start with ROC/DET curves from your validation set and pick a conservative global threshold that targets very low false positives (FMR). KVC’s public numbers at FMR=1% give a realistic starting point for internet‑scale traffic; in production, you might go tighter (e.g., 0.5%) and accept a higher FNMR knowing that flagged users get a fast human check before any penalty. (arxiv.org)

For fairness, segment performance by device, language, and demographics where available. NIST’s digital identity guidance (draft SP 800‑63‑4) emphasizes measuring FMR across demographic groups and using a fixed threshold; it also urges PAD/liveness in biometric systems—principles that map well to keystroke verification used for anti‑cheat. Treat these as design guardrails, not legal mandates. (pages.nist.gov)

Communicating fairness (and staying within TOS)

Publish a short “Fair Play & Privacy” page: what you collect (timings only), how long you keep it, and how the flag‑and‑verify ladder works. Add an appeals process for disputed flags.
Make your rules consistent with your Terms of Service—explicitly ban auto‑typers, macros, and firmware that sends impossible timings—just as established sites do. (data.typeracer.com)
Share site‑wide stats periodically (e.g., average WPM is ~42 WPM on a popular site with 14M+ tests) to set expectations and explain why a 600 WPM “run” triggers checks. (10fastfingers.com)

Implementation checklist you can copy

Engineering
Instrument per‑event timings (hold, DD/UD) client‑side; hash session IDs; never log raw text.
Ship a small embedding model and global threshold; maintain device‑specific calibrations.
Add a liveness classifier trained with public synthetic/human datasets; block paste and detect bulk insert.
Policy
Separate mobile/desktop leaderboards or policy for swipe/IME.
Document the two‑step flow: passive scoring → quick human check → moderator review (rare).
Data retention, opt‑out, and privacy‑by‑default controls aligned to GDPR Article 25 principles. (gdpr.org)

The bottom line

You don’t need to replace CAPTCHAs everywhere—just stop leaning on them as your only defense. A privacy‑first keystroke layer, tuned with modern benchmarks and paired with fast, respectful human checks, keeps leaderboards competitive and trustworthy without slowing honest typists down. (arxiv.org)