Experiment 199: Row-level capacity reservation in write_json_to_buf

Date: 2026-06-25

Status: In Review

Direction:result-transfer-shape

Benchmark Run: Focused A/B

(benchmark/experiments/select_bytes_int_heavy.dart,

benchmark/experiments/select_bytes_real_int_fastpath.dart,

benchmark/experiments/select_bytes_wide_cols.dart), order-flipped pair on a

quiet box; release-suite single-pass A/B captured as a no-regression smoke.

Problem

After exp 195 cached the column-name tokens

on the prepared-statement entry and exp 198

collapsed the per-cell integer/float formatter through to a direct write, the

inner per-cell loop of write_json_to_buf still pays one buf_ensure per

write:

 JSON_CHECK(buf_write(b, tokens_data + token_offsets[i], token_lens[i])); // ensure(N) int type = sqlite3_column_type(stmt, i); switch (type) { case SQLITE_INTEGER: JSON_CHECK(buf_write_int_json(b, sqlite3_column_int64(stmt, i)));  // ensure(24) break; case SQLITE_FLOAT: JSON_CHECK(buf_write_double_json(b, sqlite3_column_double(stmt, i))); // ensure(33) break; ... } 

Plus buf_write_char(b, '{') / buf_write_char(b, '}') / `buf_write_char(b,

',')` for braces and the row separator. On a 10k row × 20 column INTEGER

SELECT that is roughly 2N + 2 ≈ 42buf_ensure calls per row, all checking

the same b->len + n <= b->cap invariant against a json_buf that has long

since grown well past the row's working set. Each check is small (one

predicted branch) but the cumulative cost shows up at the bottom of the

encoder profile — every other cell-formatter cost above it (digit loop,

SWAR escape scan, column-name token expansion) has been amortized in

prior experiments.

Hypothesis

For rows whose cells are all NULL / INTEGER / FLOAT, total row size has a

fixed upper bound (tokens_total + N × 33 + 2). A single row-level

buf_ensure covers the entire row, after which every NULL / INTEGER / FLOAT

cell can write directly into b->data + b->len without a per-cell ensure.

TEXT / BLOB cells stay on their existing helpers (json_write_string,

json_write_base64) — both internally do their own buf_ensure, so we

just re-ensure remaining headroom after they return.

The signal should reproduce most strongly on the wider int-heavy lanes

exp 192/198 already use as their gate, and stay flat on mixed and

fractional-real lanes where the saved buf_ensure is a small fraction of

per-cell work.

Approach

In native/resqlite.cwrite_json_to_buf:

tokens, the per-cell fixed-size upper bound (col_count × (RESQLITE_JSON_FLOAT_MAX + 1),

which dominates the INT and "null" sizes), and the trailing '}' into a

single buf_ensure at row start. The 33-byte FLOAT bound is used for

every fixed cell so the calc stays a single multiply.

b->data + b->len directly. NULL / INTEGER / FLOAT branches write

the formatted bytes through fast_i64_to_str /

fast_double_to_json_num / memcpy("null", 4) straight to

b->data + b->len, advancing b->len directly. No per-cell

buf_ensure.

own ensure), then re-ensure remaining_tokens + remaining_cells + 1

(+ '}') so subsequent direct writes stay in-bounds even if the

helper grew the buffer past our row-start reservation.

buf_write_char — they fire twice per query regardless of row shape,

so the row-level reservation cannot subsume them.

The result is bit-identical JSON: the same formatters, the same token

bytes, the same TEXT / BLOB helpers. No new const data; one removed

JSON_CHECK(buf_write_char(b, '{')) / '}' / ',' per row, one

removed JSON_CHECK(buf_write(...)) per column-name token, and

(NULL / INTEGER / FLOAT) one removed buf_write_str / buf_write_int_json

/ buf_write_double_json indirection per fixed cell. No public API

change.

Results

Two order-flipped passes on each focused harness, median of 6 rounds

per lane. Same-machine quiet box (Apple Silicon, Dart 3.x AOT).

select_bytes_int_heavy.dart (exp 192's harness)

LanePass 1 basePass 1 candΔ P1Pass 2 basePass 2 candΔ P2
10k × 8 small ints27962751−1.6 %28002689−4.0 %
10k × 20 small ints66746218−6.8 %64116233−2.8 %
10k × 20 big ints (~18 digits)73597233−1.7 %74597236−3.0 %
10k × 8 mixed (4 int + 2 text + 2 real)88338782−0.6 %88468794−0.6 %
1k × 2 ints105101−3.8 %104102−1.9 %

All values µs/query median. Every int-heavy lane reproduces the same

direction (candidate faster) across the order flip; magnitudes spread

from −1.6 % to −6.8 % across the per-lane noise. The mixed-row guard

(4 int + 2 text + 2 real) stays flat at −0.6 % both passes — the TEXT

re-ensure inside the candidate's row loop balances the saved per-cell

ensures on the same row, exactly as expected.

select_bytes_real_int_fastpath.dart (exp 194's harness)

LanePass 1 basePass 1 candΔ P1Pass 2 basePass 2 candΔ P2
10k × 8 integral reals29882994+0.2 %30123005−0.2 %
10k × 20 integral reals66276613−0.2 %66186659+0.6 %
10k × 20 fractional reals68147681240.0 %68174681810.0 %
10k × 8 mixed (4 int-real + 2 frac-real + 2 text)93019353+0.6 %93089379+0.8 %
1k × 2 integral reals110109−0.9 %1091090.0 %

All real lanes sit inside ±1 %. The fractional REAL lane stays exactly

flat: fast_double_to_json_num's snprintf("%.17g") dwarfs the saved

per-cell ensures by 2–3 orders of magnitude, so the saving is

arithmetically invisible. The mixed real lane shows a reproduced

+0.6 % / +0.8 % — the per-cell ensures saved by the candidate are

spent (and slightly exceeded) by the TEXT re-ensure on the same row —

which is below the per-benchmark decision threshold.

select_bytes_wide_cols.dart (exp 190 / 195's harness)

ShapeBaseCandΔ
10k × 8 int cols2.2082.125−3.8 %
10k × 20 int cols5.1845.059−2.4 %
10k × 8 mixed cols2.4122.435+1.0 %
10k × 20 mixed cols5.7395.667−1.3 %
10k × 2 int cols0.6420.634−1.2 %
1 row × 5 mixed cols0.0140.017(µs-scale guard)
100 rows × 5 mixed cols0.0300.032(µs-scale guard)

All values ms/call median. The two int-only lanes reproduce the focused

win (−2.4 % / −3.8 %); mixed and small-row lanes are at the noise

floor (the 1-row and 100-row shapes are µs-scale and below the

harness's resolution). No regression on the mixed lanes.

Release-suite single-pass A/B

Baseline: benchmark/results/2026-06-25T07-44-46-baseline-for-exp199.md.

Candidate: benchmark/results/2026-06-25T07-33-46-exp199-row-level-buf-ensure.md.

resqlite wall-median deltas vs the baseline: 14 wins (< −5 %),

13 regressions (> +5 %), 128 neutral, single-pass.

Most flagged rows live on writer / streaming paths that this candidate

cannot mechanically touch:

Transaction (100 rows) (−34 %), Nested Transactions (savepoints)`

(+12 %), Batched Write Inside Transaction (1000 rows) (+9 %),

Initial Emission (+11 %) — write paths and stream-dispatch lanes.

The candidate touches only write_json_to_buf, which is the read /

selectBytes JSON encoder; these lanes do not call it.

TEXT + 32 KB BLOB)` +22 % — the unchanged-fanout path calls

resqlite_query_hash (FNV hash over raw cell bytes), not

write_json_to_buf. This is the canonical exp 159 / exp 177 single-pass

drift signature on a sub-ms metric.

the 1000-row and 100-row sibling lanes — the JOURNAL's "sign reversal

across sibling tests" drift signature exp 177's classifier flags.

The release-suite lane that does mechanically touch this change —

Select → JSON Bytes / Large payload (~650 KB) / resqlite selectBytes()

— moves −12.7 %, in the predicted direction. `Streaming /

Unchanged Fanout Throughput (1 canary + 10 unchanged streams)` also

moves −5.5 %, consistent with the encoder-side saving on the initial

emission path (the unchanged short-circuit then hits the hash path).

The focused order-flipped pair on select_bytes_int_heavy.dart is the

load-bearing evidence here; the release-suite single-pass A/B is a

no-regression smoke against the canonical baseline, with the flagged

rows interpreted under the standard JOURNAL drift discriminator.

Decision

In Review (candidate-accepted at the local level). The focused

order-flipped pair on select_bytes_int_heavy.dart reproduces a

same-direction candidate-faster effect on all five integer-heavy lanes

(−1.6 % / −4.0 %, −6.8 % / −2.8 %, −1.7 % / −3.0 %, −0.6 % / −0.6 %,

−3.8 % / −1.9 %). The integer-real and fractional-real lanes stay

inside ±1 %; the wide-cols int lanes carry the same direction; no

release-suite hot-path lane regresses in both directions of the

single-pass A/B. Bit-identical JSON; `dart test

test/database_test.dart test/select_bytes_transfer_test.dart

test/query_decoder_test.dart test/stream_test.dart` all pass against

the candidate.

The magnitude is smaller than exp 192 /

exp 198 (−7 % to −9 % on the

same lanes) because those experiments removed a multi-byte

formatter / a non-trivial memcpy; what's left for this hoist is the

per-cell buf_ensure branch itself, already compiled down to one

predicted comparison. The win is real but small — the natural last

collapse of the per-cell ensure stack the prior experiments left in

place.

Why kept

The savings are structurally what the diff predicts: zero per-cell

ensures on the fixed-size fast path, identical formatter output, and

the TEXT / BLOB recovery path is a single re-ensure call per

variable-size cell. The change is ~50 net lines of additive C, no

new const data, no new public API surface, and no new allocation.

The fixed-size fast path is also the entry point any future

encoder-side change (per-cell type prediction, schema-driven row

prelude, prepared-statement type cache across rows) would need: it

exposes "this row's fixed-size cells" as a single hoisted reservation

instead of a per-cell scatter.

What this leaves on the table

The remaining per-row cost is dominated by:

one cross-boundary call per cell, the floor any future fixed-cell

encoder will hit. A decltype-driven type cache on the cached

statement would skip the column_type call, but SQLite type affinity

can vary per row in the general case and the audit cost is real.

memcpys, both already on direct-write paths.

see exp 198's "what this leaves on the table").

Operational notes

slow path is now the variable-cell path, gated by per-cell type.

fractional-REAL, embedded-NUL text, and base64 BLOB selectBytes

tests all pass unchanged against the candidate; no test changes

were necessary.

dependence (uses only standard C library memcpy).