Experiment 195: Stmt-cache pre-encoded JSON column-name tokens

Date: 2026-06-23

Status: In Review

Direction:result-transfer-shape

Benchmark Run: none — focused

benchmark/experiments/select_bytes_repeated_calls.dart

order-flipped pair plus the existing

benchmark/experiments/select_bytes_wide_cols.dart

(exp 190's harness) as a no-regression guard. No release-suite run because

the changed path is the per-query setup inside write_json_to_buf reached

only via selectBytes() and the existing release lanes (10K rows / 1K rows

/ Large payload) are dominated by per-row stepping rather than per-query

setup work.

Problem

Exp 190 pre-built each column's

"name": / ,"name": token once at first-row time inside write_json_to_buf

and stored the result in a per-query scratch buffer

(tokens_buf). Subsequent rows pay one buf_write

per column instead of comma + json_write_string + colon.

That amortizes per-row work within a single query. The per-query setup is

still paid in full on every call:

(worst-case name_len * 6 + 4 per column, but always ≥ 64);

strlen, json_write_string (SWAR scan + escape walk for the name), and

two buf_write_char calls (, and :);

For the same prepared SQL re-executed N times, that work runs N times even

though sqlite3_column_name(stmt, i) returns the same pointer for the

lifetime of the prepared statement and the encoded tokens are byte-identical

across re-executions. The C-side statement cache already keeps the prepared

statement, its bind-parameter count (exp 077), its read-tables / column

dependencies (exp 106), and the writer's pre-prepared TX statements (exp 101)

across calls. The JSON tokens are the same shape of "depends on the prepared

SQL only" state and have nowhere to live between calls.

Hypothesis

If the pre-encoded JSON column-name tokens are cached on the resqlite_cached_stmt

entry, every selectBytes() call after the first reuses them — eliminating

the per-query buf_init malloc/free pair and the first-row pre-encode walk.

The per-row loop is bit-identical to exp 190 (one buf_write per column per

row), so per-row work stays exactly where it is.

The expected signal:

wall (~0.5 µs setup vs ~5–10 µs total per call) should move 5–10%;

stay neutral — the 0.5 µs savings is ~0.02% of wall;

Acceptance criterion: two order-flipped focused passes must improve the

small-rowset multi-column primary lanes by more than 5%, with the

large-rowset guard neutral, and the existing exp 190 wide_cols.dart

shapes either neutral or improved.

Approach

In native/resqlite.c:

json_name_token_offsets, json_name_token_lens, json_name_tokens_len,

and json_name_tokens_col_count (the build sentinel);

tokens on first call and returns immediately on subsequent calls. The

token shape is identical to exp 190 — column 0 emits "name":,

columns 1+ emit ,"name": — so the per-row inner loop is unchanged;

the caller (resqlite_query_bytes) already has entry in scope from the

preceding get_or_prepare_reader call;

_token_lens_stack[64] + _col_names_stack[64] + _col_name_lens_stack[64]

locals and the matching free() paths from write_json_to_buf's cleanup;

stmt_cache_entry_dispose so cached buffers are freed on entry eviction

and connection close.

A degenerate col_count <= 0 case (no columns to encode) flags the entry

as built via json_name_tokens_col_count = -1 so the build path is not

re-entered.

The change does not alter any user-visible JSON output — the encoded

tokens are byte-identical to exp 190.

Added the focused harness

benchmark/experiments/select_bytes_repeated_calls.dart

with:

per-row work should dominate any per-query setup savings).

Each lane reports median microseconds per call over 1000 calls/sample ×

11 samples, after a 16-call warm-up, so per-call savings around 0.5 µs are

visible above the harness floor.

Results

Focused select_bytes_repeated_calls.dart, two order-flipped passes

(medians in µs/call):

LanePass 1 baselinePass 1 candidateΔPass 2 baselinePass 2 candidateΔ
1 row × 8 int cols10.41510.159−2.5%10.0199.936−0.8%
1 row × 20 int cols5.7305.204−9.2%5.3554.967−7.2%
1 row × 8 mixed cols5.0845.045−0.8%4.7514.752+0.0%
10 rows × 8 int cols7.3327.128−2.8%6.9367.001+0.9%
10 rows × 20 int cols11.02710.460−5.1%10.30410.027−2.7%
100 rows × 8 int cols29.75030.296+1.8%29.93629.120−2.7%
1000 rows × 8 int cols251.270253.841+1.0%251.156246.973−1.7%

Pass 1 ran baseline first then candidate; Pass 2 ran candidate first then

baseline. The 1-row × 20 col and 10-row × 20 col primary lanes reproduce

the predicted shape — same-direction movement across the flip, magnitude

inside the same band. 1 × 8 mixed barely moves: the per-query setup is

already < 1 µs at this shape and the harness sees it inside variance. The

100 / 1000 row guards show sign reversal across the flip (+1.8% / −2.7%

and +1.0% / −1.7%), which under exp 177's drift discriminator rule

classifies as drift-suspected rather than reproduced — consistent with

"no real movement" at sizes where per-row work dominates.

Exp 190's existing select_bytes_wide_cols.dart harness, two order-flipped

passes (medians in ms/call):

LanePass 1 baselinePass 1 candidateΔPass 2 baselinePass 2 candidateΔ
10k rows × 8 int cols2.4792.426−2.1%2.4672.427−1.6%
10k rows × 20 int cols5.8685.789−1.3%6.1785.770−6.6%
10k rows × 8 mixed cols2.6602.534−4.7%2.6182.586−1.2%
10k rows × 20 mixed cols6.2576.077−2.9%6.2166.143−1.2%
10k rows × 2 int cols0.7400.714−3.5%0.7370.711−3.5%
1 row × 5 mixed cols0.0180.018n/a0.0170.017n/a
100 rows × 5 mixed cols0.0330.033n/a0.0340.033n/a

Same-direction movement on every 10k-row lane across the flip. The 1-row

and 100-row lanes on this harness sit at the harness's millisecond-reporting

resolution floor — the focused select_bytes_repeated_calls.dart harness

above is the correct lane for those shapes.

Decision

Accepted / In Review. The per-query setup amortization is small but

reproducible on the predicted shape: 1-row × 20-col and 10-row × 20-col

lanes move 5–9% same-direction across two order-flipped passes, where the

~0.5 µs of removed per-query work is a measurable fraction of total wall.

The 10k-row lanes also trend candidate-faster across both passes on every

shape, consistent with eliminating the per-query malloc(64) + free

pair from the hot path. The 100 / 1000-row guards show drift-suspected

sign reversal — consistent with "no real movement" where per-row work

dominates per-query setup by 100× or more.

The implementation is bounded — five new fields on the cache entry, one

new helper, a one-arg signature change on write_json_to_buf. Memory cost

per cached statement is O(col_count * 8 + name_byte_count) bytes, capped

by STMT_CACHE_MAX = 32 per reader/writer connection (~tens of KB worst

case for a fully populated cache of wide schemas). Public JSON output is

byte-identical to exp 190.

The release suite is not the right denominator for this change: the

public Select Bytes lanes either return 1K+ rows (per-row work dominates)

or are large-payload-bounded (per-row work amplified). The durable gate

is select_bytes_repeated_calls.dart for small repeated queries plus

select_bytes_wide_cols.dart for the broader compatibility band.

Future Notes

work (e.g. a fractional-REAL fast path following exp 194) should not

need to touch the token cache. The cache only stores invariant token

bytes; per-cell value emission is the same code path as today.

exp 190 used for wide schemas is no longer needed: the cache entry's

per-column arrays are heap-allocated unconditionally, paid once per

statement instead of once per query.

prepared SQLs, the per-statement cache cost (~hundreds of bytes per

entry) becomes more visible at the existing STMT_CACHE_MAX = 32 cap.

Reopen if release-suite memory probes flag the cache size.

exp 190-style token pre-encoding for any other future per-query

invariant work (e.g. per-column type lookups if a future workload

shows type dispatch is hot). Pattern-match on "per-query, depends only

on the prepared stmt" before reaching for a per-query scratch buffer.

Validation