Specs / Reference

Benchmarks

Benchmarks — What We Measure, How, and What “Good” Means

Status: spec only; harness pending. This doc defines the perf bar weaver commits to clearing on day one and the methodology for measuring it across competitor editors. The numbers themselves are not in this file — they belong in CI artifacts and the /perf Playground route. This is what gets measured, how, and what counts as passing.

1. Why benchmarks now (pre-implementation)

A pre-implementation spec sounds early to commit to numbers. It’s not — it’s the right time. Two reasons:

  1. Perf is a design input, not a tuning step. Choosing Loro over Y.js, the imperative DOM surface, the SQLite read-model — each is a perf bet. Stating the bar now constrains the design to deliver on it. Discovering at v0.9 that we missed the bar by 4× is too late.
  2. A perf bar in spec form is a contract reviewers can hold us to. “It feels fast” is not reviewable; “p95 input-to-paint < 16 ms at 10 k blocks” is.

2. Reference documents

Every benchmark runs against a fixed corpus. The corpus is committed to the repo under benchmarks/fixtures/ (when the harness lands) and versioned. Each fixture is described here by shape and target conditions; the actual content is reproducible from a documented generator.

FixtureShapeWhy
tiny-5050 blocks, ~5 kB serializedCold-start floor.
medium-2k2 000 blocks, mixed kinds (paragraphs, headings, lists, callouts, one code block, one table, one image), nesting depth ≤ 3, ~120 kB serializedRealistic working doc; the modal expected size.
large-10k10 000 blocks, similar mix, depth ≤ 4, ~600 kB serialized”Long doc” stress test.
huge-100k100 000 blocks, mostly paragraphs, depth ≤ 4, ~6 MB serializedOut-past-the-bar test; we are not expected to be interactive here, only to load and scroll.
op-replay-100k100 000 CRDT ops applied sequentially (insert / delete / mark) to an empty doc, ending in a medium-2k-shaped resultCRDT apply path; this is the Loro vs. Y.js ceiling test.
concurrent-edits-2pTwo peers each issuing 10 000 ops over a shared doc, ~50% overlapping rangesMerge cost and concurrent rebase.
agent-stream-5kOne agent peer streams 5 000 character inserts (~50 chars/sec simulated; LLM-rate) while a human peer types in parallelAgent-collab realistic workload.
marks-dense1 000 blocks with average 5 overlapping marks per text-bearing blockMark serialization + rendering.

Each fixture has a deterministic generator (benchmarks/fixtures/generate.ts); the binary fixture files are byte-stable. The generator is in scope of the benchmarks harness, not the editor.

3. Metrics

Each metric is defined by (a) what is measured, (b) where it is measured from, and (c) the pass threshold for v1.

3.1 Init / load

Cold-cache and warm-cache numbers are stated separately because their dominant costs differ. Cold-cache includes WASM (Loro + wa-sqlite, ~1.5–3 MB) instantiate + first-paint; warm-cache assumes both WASM modules are in the HTTP cache and the JIT has primed.

MetricDefinitionFixturePass bar (v1)
Cold init (cold cache)Time from <WeaverEditor> mount to first paint of all visible blocks; HTTP cache empty, fresh Chromium contextmedium-2k≤ 500 ms p50, ≤ 800 ms p95
Cold init (warm cache)Same, with WASM modules already in disk cachemedium-2k≤ 250 ms p50, ≤ 400 ms p95
Snapshot importTime from Loro.import(bytes) call to returnlarge-10k≤ 80 ms p50
Snapshot import (huge)Samehuge-100k≤ 500 ms p50
Snapshot → first paint (huge)Time from Loro.import(bytes) call to first paint of the first visible block (assumes virtualization on; see §4.6)huge-100k≤ 1500 ms p50
Op replayTime to apply N ops sequentially from an empty doc; see §4.2 “Y.js baseline definition” for the comparison-baseline contractop-replay-100k≤ 1 200 ms p50; ≥ 3× faster than the pinned Y.js baseline on the same machine (ADR 0001)

3.2 Input latency

MetricDefinitionFixturePass bar (v1)
Keystroke-to-paintTime from keypress event in the contenteditable surface to the paint frame containing the new character (measured via PerformanceObserver('event') for compatibility with INP)medium-2k≤ 16 ms p95 (single 60 Hz frame)
Keystroke-to-paint (large)Same; virtualization is required to make this achievable (see §4.6)large-10k≤ 33 ms p95 (two frames)
INP (Interaction-to-Next-Paint)The Chrome 2026 responsiveness metric: 75th percentile of all interaction latencies over a sustained typing trace (PerformanceObserver({ type: 'event', durationThreshold: 16 }))medium-2k, 60 s of synthesized typing≤ 200 ms p75 (“Good” per web.dev INP thresholds), preferably ≤ 100 ms
IME composition latencyTime from compositionupdate event to paint of the composed character (CJK / dead-key paths take a different route through the contenteditable surface than keypress)medium-2k, simulated kana / hanzi composition via CDP Input.imeSetComposition≤ 33 ms p95
Mark toggleTime from dispatchCommand("toggleBold", selection) to paint of the formatted runmedium-2k, selection covers 20 chars≤ 16 ms p95
Block transformTime from block.transform(id, "heading") to paintmedium-2k≤ 16 ms p95
Large pasteTime from paste event (10 000 chars, marked HTML) to first paint of the pasted contentmedium-2k≤ 250 ms p95
Undo on long historyTime from editor.undo() to paint, after a 10 000-op history accumulated in one tab sessionmedium-2k + 10 000 prior ops≤ 50 ms p95
Scroll-while-typing dropped framesCount of dropped frames (frames > 16.67 ms) during a 5 s typing + smooth-scroll tracelarge-10k≤ 5 dropped frames per 5 s
Long TasksSum of PerformanceLongTaskTiming entries during a 60 s typing tracemedium-2k≤ 200 ms total

3.3 Collaboration

MetricDefinitionFixturePass bar (v1)
Concurrent mergeTime from receiving a 10-op remote update to it being painted, while the local peer is typingconcurrent-edits-2p mid-stream≤ 33 ms p95
Op encode + sendTime to encode a single local op for the wirelive typing on medium-2k≤ 1 ms p99
Op decode + applyTime to decode a 100-op batch from the wire and applylive typing on medium-2k≤ 10 ms p95
Awareness fan-outTime from a peer’s EphemeralStore update to it appearing on this peer’s UI5 connected peers≤ 100 ms p95

3.4 Memory

Heap numbers assume the UndoManager history-trim policy documented in architecture.md §2 (default-bounded peer-scoped retention; not unbounded). Without that policy, “after-init” heap is meaningless because the test never edits the doc.

MetricDefinitionFixturePass bar (v1)
Heap idleperformance.measureUserAgentSpecificMemory() ten seconds after init, no edits, after forced GCmedium-2k≤ 80 MB
Heap idle (large)Samelarge-10k≤ 200 MB
Heap peak during typingMaximum of performance.measureUserAgentSpecificMemory() sampled every 1 s during a 60 s typing trace, before GCmedium-2k, 60 s typing≤ 1.5× the idle figure
Heap growth per opLinear-regression slope of heap vs. ops applied during op-replay-100k, post-GCop-replay-100k≤ 64 B/op
OPFS tab-restoreTime from new-tab navigation (cold) to first paint of a doc previously OPFS-persisted (no network)medium-2k persisted≤ 600 ms p95

3.5 Bundle size

Targets match prd.md §8 “Success criteria / Quantitative”.

MetricDefinitionPass bar (v1)
Client bundle (core)@weaver/core + @weaver/dom + @weaver/react + Loro WASM, gzipped≤ 600 KB
Server bundle@weaver/server (DO worker), gzipped≤ 1 MB
Playground bundleapps/playground/dist total, gzipped≤ 1.5 MB

3.6 Web vitals & Lighthouse — for the docs site and the Playground

A separate concern from editor perf: how the shipped sites (the docs site and the future Playground) score on web vitals. We run Lighthouse + a Web-Vitals capture pass on every deployment so the public-facing surfaces don’t silently regress.

MetricDefinitionSurfacePass bar (v1)
LCP (Largest Contentful Paint)web.dev definitiondocs site /, /docs/<each>≤ 2 500 ms p75 (“Good”)
LCP — Playground /samePlayground /≤ 2 500 ms p75 (Playground budget acknowledges WASM init)
CLS (Cumulative Layout Shift)web.dev definitiondocs site, Playground≤ 0.1
INP (Interaction-to-Next-Paint)web.dev definitiondocs site (navigation only), Playground / (editor typing)≤ 200 ms; ≤ 100 ms preferred
TBT (Total Blocking Time)Lighthouse lab metricdocs site, Playground≤ 200 ms
Lighthouse — Performance scoreLighthouse 12+ on mobile emulation (Slow 4G, 4× CPU throttling)docs site /, /docs/architecture≥ 90
Lighthouse — Performance scoresamePlayground /≥ 70 (lower bar; WASM init dominates)
Lighthouse — AccessibilityLighthousedocs site, Playground≥ 95
Lighthouse — Best PracticesLighthousedocs site, Playground≥ 95
Lighthouse — SEOLighthousedocs site≥ 95 (Playground is exempt; not indexed)
No new errors in consoleLighthouse “no-vulnerable-libraries” + “errors-in-console” auditsdocs site, Playgroundboth pass

How Lighthouse runs

  • CI workflow .github/workflows/lighthouse.yml runs on every push to main and on PRs that touch apps/docs/** or apps/playground/**. It uses treosh/lighthouse-ci-action@v12 against the deployed Pages preview URL (PRs) or the production URL (main).
  • Config lives at apps/docs/lighthouserc.json and apps/playground/lighthouserc.json — each lists the URLs to audit, the assertions (above), and the LHCI server token (a public-server token for the OSS LHCI dashboard, no secrets needed).
  • PR comments — the action posts a comment with the score deltas; a red ✗ on any assertion fails the check.
  • Field data — once the sites have visitors, a web-vitals JS shim captures real-user LCP/INP/CLS and POSTs to a CF Worker that writes to D1. The dashboard at weaver-docs.pages.dev/perf/field shows p75 over the last 7 days. (Field data is reporting-only; the pass bar is on the CI Lighthouse number.)

This is intentionally separate from §3.1–3.4 (editor-perf benchmarks). The editor-perf suite measures the library; Lighthouse + web vitals measure the deployed product. A passing editor with a failing Lighthouse is still a failing release.

3.7 Server-side (Cloudflare Durable Object)

DO numbers assume same-colo for client and DO (i.e., the WebSocket lands on the colo nearest the client; CF routes that way by default). Cross-region jitter is not what this section measures — it’s a network concern, not a server-perf one.

MetricDefinitionFixturePass bar (v1)
DO accept opTime from WS frame arrival at the DO to op accepted, persisted, and broadcastlive medium-2k, single op, same-colo≤ 20 ms p95
DO snapshot serveTime to read R2 snapshot, decode, sendlarge-10k≤ 400 ms p95
DO write-gate checkTime for the tier-write per-frame check (per ADR 0005 — not per-op validation; the cooperative-org trust model replaces the per-op pipeline with a frame-granularity gate)live medium-2k≤ 1 ms p99

4. Methodology

4.1 Where benchmarks run

SuiteWhereWhen
In-browser micro (init, input latency, mark toggle, block transform)Headless Chromium via Playwright (pinned); inputs use CDP Input.dispatchKeyEvent (rawKeyDown + char), not Playwright’s synthesized keypress; latency captured via PerformanceObserver({ type: 'event' }) for INP-compatible numbersEvery PR + nightly main
Heap / memoryHeadless Chromium with performance.measureUserAgentSpecificMemory(); forced GC via CDP HeapProfiler.collectGarbageNightly main only (variance)
Op-replay / mergePure Node.js (Loro WASM, no DOM)Every PR
DO serverworkerd test runner with a stubbed R2Nightly main + on-demand
Lighthouse / Web VitalsHeadless Chromium driven by LHCI; mobile emulation profile (Slow 4G, 4× CPU)On every push to main and on PRs touching apps/docs/** or apps/playground/**
Playground /perfThe visitor’s browserOn-demand; surfaces “perf health” in the deployed Playground for anyone to see

The same suite definitions back all six runners; the harness is benchmarks/run.ts (a typed @effect/cli CLI per CLAUDE.md’s “Effect-TS CLI over inline shell” guidance).

4.2 Comparison editors

For each metric where a meaningful comparison exists, we run the same workload against:

  • Lexical (no collab; @lexical/react v0.21+, vanilla)
  • Lexical + @lexical/yjs (collab; Y.js v13 — see baseline pin below)
  • ProseMirror + y-prosemirror
  • Tiptap + Hocuspocus
  • CodeMirror 6 + y-codemirror.next — the closest editor to weaver’s “imperative DOM patched from CRDT diff” thesis; included specifically so the comparison isn’t biased toward React-managed surfaces.
  • BlockSuite (page mode) once its Loro migration is stable enough to run the harness against
  • Plain contenteditable as a floor

The comparison harness is in benchmarks/competitors/; each competitor lives in its own subdirectory with its own minimal app that does only what’s needed to run the suite. We do not embed competitor editors into the Playground.

The comparison matrix in CI artifacts marks each (metric × competitor) cell as measured, n/a-feature-absent, or n/a-licensing. No blanks. A reviewer can read the matrix and see exactly what was run.

Comparisons are reported as ratios alongside the absolute number — absolute numbers vary across hardware, but the absolute number is what catches “the ratio looks great but the editor is unusable” cases. Both must appear in the artifact.

Y.js baseline definition (the contract behind ”≥ 3× faster than Y.js”)

The op-replay-100k comparison against Y.js is pinned to this exact configuration so the ratio is falsifiable. Changing any of these is a benchmark-result-invalidating event and bumps benchmarks/results/<date>/schema_version.

KnobPinned value
Y.js versionyjs@13.6.x (latest 13.6 minor at the time of the comparison run; lockfile-pinned per-run)
Encodingv2 (Y.encodeStateAsUpdateV2)
Transact strategyOne Y.transact(doc, () => { … }) wrapping the entire 100k-op replay
GCenabled (Y.js default; Y.Doc({ gc: true }))
Framingy-protocols/sync is not included; the comparison measures Y.applyUpdateV2 cost, not network framing
Memory snapshot pointafter replay completes, before Y.encodeStateAsUpdateV2, after one forced GC

The Loro side runs against loro-crdt@<pinned> (matching @weaver/wasm’s pinned version) with identical batching semantics (one outer transaction wrapping all 100k ops).

4.3 Hardware baseline

The CI runner is pinned to runs-on: ubuntu-24.04 (not ubuntu-latest, which silently shifts under us). Each result JSON includes the runner OS, kernel, /proc/cpuinfo CPU model, total RAM, Chromium version (pinned via npx playwright install chromium@<sha>), Node version, and pnpm version.

For numbers reported in marketing / READMEs, the canonical machine is an Apple M2 (8c/8GPU) MacBook Air, 16 GB RAM, macOS 14+, Chromium 138+. On macOS, the machine must be on AC power and not in Low Power Mode; the pmset -g batt output is captured in the artifact for the run that produced the headline number. We do not quote benchmarks from local developer machines without that capture.

4.4 Statistical method

  • Each metric runs N=50 iterations after 5 warmup iterations. (Up from N=20 in earlier drafts of this spec — N=50 gives ~80% power to detect a 20% mean shift at α=0.05 under typical p95 noise.)
  • Cold-cache metrics each get their own browser context with HTTP cache cleared.
  • p50 / p95 / p99 are reported; the pass bar is what the §3 table says (mostly p95).
  • Regression detection uses a Mann-Whitney U test between the PR’s N=50 samples and the baseline’s N=50 samples; p < 0.01 with a median shift ≥ 20% counts as a regression. A simple “two-consecutive misses” rule applies on top, to catch slow drifts that don’t trip MWU.

4.5 Reproducibility

A contributor with the repo, the pinned Node version (declared in package.json#engines), the pinned pnpm version (declared in packageManager), and the Playwright-pinned Chromium can run:

pnpm --filter @weaver/benchmarks bench --fixture=medium-2k --suite=input

and produce results where, for every metric in §3, abs(local_p95 - ci_p95) / ci_p95 ≤ 0.15. The harness writes results to benchmarks/results/<date>/<commit>.json and to a markdown summary suitable for pasting into a PR comment.

The result JSON includes:

  • runner.os, runner.kernel, runner.cpu (from /proc/cpuinfo), runner.ramMB
  • chromium.version (from playwright --version)
  • node.version, pnpm.version
  • fixtureSeed (the integer seed used by benchmarks/fixtures/generate.ts)
  • schemaVersion (an integer; the JSON schema is published at benchmarks/results-schema/v<n>.json and changes bump the integer)

4.6 Virtualization

Pass bars at large-10k and huge-100k assume block-level virtualization is on (only blocks within ±1 viewport of the visible area are mounted; off-screen blocks render as fixed-height placeholders). Without virtualization, the React reconciler dominates and the bars are unreachable on any current hardware. The virtualization design lives in architecture.md §3; the benchmark harness’s “large” fixtures verify it stays on.

5. What is not a benchmark

To keep the list honest about what we don’t measure:

  • Subjective “feel.” No <some abstract editor> “feels snappier” claim ever appears in benchmark output; only measured numbers.
  • Feature parity. Belongs in lexical-parity.md, not here.
  • Coverage. “We test on Chrome and Firefox” is a release-engineering concern, not a perf bar.
  • AI inference latency. The LLM round-trip dominates anything else; not a property of the editor.
  • Network latency on real networks. We synthesize the WebSocket transport in CI; real-network numbers go in field reports, not the benchmark suite.

6. Failure modes and escalation

SymptomLikely causeFirst action
Cold init regresses ≥ 20%New synchronous work in mount pathBisect the last 10 commits to the affected suite
Keystroke-to-paint regresses ≥ 20%DOM patch slowdown or extra subscriber on hot pathProfile with Chrome DevTools Performance tab, attach trace
Heap-per-op regressesNew allocation per op in the editor coreDiff heap-snapshot between baseline and PR
Op-replay regresses while concurrent-merge stays flatLoro WASM build issueCheck @weaver/wasm version pin
Concurrent-merge regresses while op-replay stays flatEditor-layer subscriber thrashCheck useBlock selector equality

Outcome rubric

The benchmarks suite is delivered when an independent grader, seeing only the deployed Playground /perf route, the CI nightly artifact for main, and this spec, can mark each criterion below as binary pass/fail.

Coverage

  • A harness exists under benchmarks/ that implements every metric in §3.
  • Every fixture in §2 is reproducibly generated by a script under benchmarks/fixtures/; running the script twice yields byte-identical fixture files.
  • A nightly CI run executes the full suite on main and uploads the JSON results as an artifact retained ≥ 30 days.
  • A per-PR CI run executes the fast subset (§4.1 column “Every PR”) in ≤ 5 minutes wall-clock.

Pass thresholds

  • Every “Pass bar (v1)” cell in §3 has a corresponding green check on the most recent nightly run.
  • The comparison matrix in §4.2 marks each (metric × competitor) cell as exactly one of measured, n/a-feature-absent, n/a-licensing. No blank cells. (Grader test: scan the artifact; reject if any cell is empty.)
  • For op-replay-100k, weaver is at least 3× faster than the Y.js baseline pinned in §4.2 “Y.js baseline definition” on the same hardware. The artifact reports both the absolute weaver number and the absolute Y.js number alongside the ratio.
  • For keystroke-to-paint at medium-2k, p95 ≤ 16 ms on the canonical machine (CI runner numbers are annotated with their machine generation per §4.3).
  • For Lighthouse (§3.6), each pass bar is enforced as an LHCI assertion configured at apps/docs/lighthouserc.json / apps/playground/lighthouserc.json; a missing assertion is itself a rubric failure.

Reporting

  • The Playground /perf route renders a table of the latest results for at least the §3.1, §3.2, §3.4 metrics, runnable on visit.
  • A regression-detector script exists at benchmarks/scripts/detect-regression.ts; on a PR, the CI workflow invokes it and posts a PR comment whenever any §3 metric crosses its pass bar or regresses by ≥ 20% on the Mann-Whitney U test (§4.4). Grader test: induce a synthetic 25% slowdown in a test PR; verify the comment appears.
  • Each metric’s table row in the artifact shows: pass-bar target, latest p50/p95, previous-run p95, percent delta, ratio vs. best non-weaver competitor.
  • The result JSON schema is published at benchmarks/results-schema/v<n>.json (per §4.5); breaking changes bump schemaVersion.

Reproducibility

  • A contributor following §4.5 produces, on the canonical machine, results where for every §3 metric abs(local_p95 - ci_p95) / ci_p95 ≤ 0.15.
  • The result JSON file path matches the schema documented in §4.5 (the listed keys are all present).
  • A flakiness budget: across the most recent five nightly runs, the p95 of the per-run p95 varies by ≤ 10% for every §3 metric; a metric outside this band is flagged as “unstable” in the artifact and excluded from regression detection until restabilized.

Output quality

  • Each §3 metric row specifies all four of: input fixture id, sample count, warmup count, and the unit of the pass-bar number.
  • The harness emits exactly one JSON file per run, matching the §4.5 schema; results are not stored as screenshots.
  • The pass-bar table in §3 is the single source of truth for “is this passing” — no parallel pass/fail tracking elsewhere in the corpus or in CI.

See also

  • playground.md — the /perf route surfaces these numbers to visitors.
  • lexical-parity.md — the feature surface the perf bar must hold for.
  • wasm-strategy.md — the Loro / wa-sqlite perf assumptions this suite verifies.
  • ADR 0001 — the Loro-over-Y.js perf claims this suite validates.
  • hard-problems.md — the implementation problems whose perf consequences this suite catches.