Why typing tests break for RTL users
If your site includes Arabic or Hebrew passages, you’re already in bidirectional (bidi) territory. Modern Arabic/Hebrew lines often mix right‑to‑left letters with left‑to‑right digits and Latin fragments, and neutral punctuation shifts its allegiance based on context. That’s exactly what the Unicode Bidirectional Algorithm (UAX #9) is designed to handle—but its correct behavior can look counterintuitive in a typing test UI. (unicode.org)
Two quick realities set the stage:
- Many punctuation symbols are neutral and resolve their direction from surrounding text.
- Digits are laid out left‑to‑right even inside Arabic/Hebrew runs; the overall paragraph direction also matters.
The hidden culprits: invisible marks and mirrored punctuation
Unicode defines invisible “direction marks” that behave like zero‑width letters. The classics are LRM (U+200E) and RLM (U+200F); for Arabic contexts there’s also ALM (U+061C). They don’t render, but they nudge neutrals and digits to join the intended side. The Unicode FAQ literally calls them “invisible letters,” and UAX #9 treats them as explicit formatting/mark characters. In a typing test, accidentally including or omitting one can silently add an “error.” (unicode.org)
Mirrored punctuation is another source of confusion. Parentheses, brackets, and some math symbols carry a normative Bidi_Mirrored property. Renderers flip their glyph orientation in RTL runs—there usually isn’t a separate “mirrored character” to type. If your test compares the visual glyph a user sees to a hardcoded character image, you’ll miscount errors; you must compare code points, not pictures. ICU’s own notes emphasize that real mirroring is a renderer/glyph‑selection job. (unicode.org)
Numbers and neutrals inside RTL runs: what actually happens
UAX #9 resolves direction in phases. After setting a paragraph base level, it applies rules W1–W7 for “weak types” (like digits and their adjacent separators) and N0–N2 for neutrals (like most punctuation). A few highlights that routinely bite typing apps:
- European digits (Bidi class EN) are ordered left‑to‑right; adjacent separators/terminators (like ‘+’, ‘-’, ‘/’, ‘:’) may be reclassified to stick with the number (W5–W6).
- A run of neutrals between strong RTL on both sides resolves to RTL; between strong LTR on both sides, to LTR (N1–N2).
- Bracket pairs are handled together so that both ends resolve consistently (N0/BD16).
Practical example (simplified): suppose your paragraph is RTL and the target text is “سعر 12/05/2026 (تقديري)”. The date’s digits run LTR, and the slashes tend to join the digits per W5/W6; the parentheses mirror visually. If your caret logic assumes “everything goes right‑to‑left,” it will look like the caret jumps “the wrong way” when users type digits or slashes. That’s the bidi rules doing their job. (unicode.org)
Where typing tests go wrong
- Inflated error counts: Comparing literal strings that include hidden LRM/RLM/ALM marks, or comparing glyph appearance rather than code points.
- Caret chaos: Moving a caret visually through an RTL line with LTR digit runs without mapping visual↔logical indices leads to off‑by‑one highlights and wrong “current character” displays.
- Bad prompts: Unbalanced brackets or punctuation‑heavy fragments with no strong characters leave neutrals to resolve unpredictably.
All three are preventable with bidi‑aware design.
Bidi‑aware prompts that feel natural
- Prefer “isolation” for inserted spans. When you inject a user name, link, or English snippet into Arabic/Hebrew prompts, wrap it in bidi isolates (LRI/RLI/FSI … PDI) or, in HTML, the semantic element; this prevents the snippet from influencing surrounding text order. Directional isolates were introduced in Unicode 6.3 and are the modern alternative to embeddings/overrides. (unicode.org)
- Set an explicit base direction for the prompt. On the web, use dir="rtl"/"ltr" (or dir="auto" on inputs). W3C guidance favors markup (dir, bdi) over CSS hacks like unicode-bidi when authoring HTML. (w3c.github.io)
- De‑ambiguate with strong characters. If a prompt begins with only neutrals (e.g., “(…) – 2026/05/12”), seed the line with an appropriate strong character or an LRM/RLM/ALM to lock the intended behavior. The FAQ explicitly notes these marks as tools for overriding defaults. (unicode.org)
Sanitizers and comparators that don’t punish RTL users
- Strip bidi formatting controls before scoring. Many test‑taker keyboards or copy/paste paths will inject LRM/RLM/ALM. Use your Unicode library to remove them during comparison so an invisible mark isn’t counted as a typo. In ICU, ubidi_writeReordered() supports removing controls; you can also filter by the Bidi_Control/Format category. (unicode-org.github.io)
- Normalize numbers, don’t re‑order them. Arabic locales may shape digits (Arabic‑Indic vs. European). Use locale‑aware number formatting/shaping rather than assuming ASCII digits only. ICU’s ArabicShaping discusses shaping options so you can display consistently while preserving the underlying EN ordering that UAX #9 expects. (unicode-org.github.io)
- Compare by code points, not glyphs. Mirrored punctuation is a rendering effect; the underlying characters are the same. Your diff should treat ‘(’ typed in an RTL run as the same code point, even though it displays like ‘)’. The Unicode spec makes mirroring normative, and ICU reiterates it’s done by glyph selection. (unicode.org)
Result viewers and caret behavior that make sense
- Use visual↔logical index maps. When highlighting the “next character” or placing the caret, derive positions from the bidi reordering results—don’t guess. ICU exposes ubidi_getLogicalMap(), ubidi_getVisualMap(), and related APIs precisely for caret/selection mapping. Tie your selections and diffs to logical indices; render highlights at visual indices. (unicode-org.github.io)
- Respect paragraph direction and isolates in the UI. In HTML, wrap dynamic snippets with or use dir="auto" for free‑text fields; this prevents a pasted English URL from flipping your Arabic line. W3C’s inline bidi guidance and MDN’s bdi docs cover practical patterns. (w3c.github.io)
- Show a “reveal invisibles” toggle. For debugging and fairness, let advanced users reveal U+200E/LRM, U+200F/RLM, and U+061C/ALM as visible placeholders during review, while still ignoring them for scoring. The FAQ endorses these as invisible direction hints, so surfacing them in review prevents confusion. (unicode.org)
Tiny lab: try these and watch the caret
- Neutral clutch: Type “(test)” at the start of an RTL paragraph. Then add an RLM before the opening parenthesis and watch it stick to the RTL side.
- Digit run: In an RTL paragraph, type “123-45”. The hyphen joins the digits (W5/W6), and the caret moves left‑to‑right through the number even though the line is RTL.
- Brackets: Mix Arabic text with “[ab]”. The bracket pair’s direction is resolved together (N0/BD16), keeping the pair visually consistent.
Quick checklist for implementers
- Choose RTL prompts with balanced brackets and an explicit base direction.
- Use isolates (FSI/LRI/RLI … PDI) or for injected spans; avoid CSS‑only direction hacks when authoring HTML. (w3c.github.io)
- Strip bidi controls for scoring; don’t penalize invisible marks. (unicode-org.github.io)
- Compare code points, not glyph appearance; let the renderer mirror punctuation. (unicode.org)
- Drive caret, selection, and diffs from ICU bidi maps (logical↔visual). (unicode-org.github.io)
With a small dose of bidi awareness, your Arabic/Hebrew users get a fair test—and you get cleaner data.