Experiment 195: Stmt-cache pre-encoded JSON column-name tokens
Date: 2026-06-23
Status: In Review
Direction:result-transfer-shape
Benchmark Run: none — focused
benchmark/experiments/select_bytes_repeated_calls.dart
order-flipped pair plus the existing
benchmark/experiments/select_bytes_wide_cols.dart
(exp 190's harness) as a no-regression guard. No release-suite run because
the changed path is the per-query setup inside write_json_to_buf reached
only via selectBytes() and the existing release lanes (10K rows / 1K rows
/ Large payload) are dominated by per-row stepping rather than per-query
setup work.
Problem
Exp 190 pre-built each column's
"name": / ,"name": token once at first-row time inside write_json_to_buf
and stored the result in a per-query scratch buffer
(tokens_buf). Subsequent rows pay one buf_write
per column instead of comma + json_write_string + colon.
That amortizes per-row work within a single query. The per-query setup is
still paid in full on every call:
- one
buf_init(&tokens_buf, ...)→ 64-bytemallocperselectBytes()call
(worst-case name_len * 6 + 4 per column, but always ≥ 64);
- one walk over every column at first-row time, calling
sqlite3_column_name,
strlen, json_write_string (SWAR scan + escape walk for the name), and
two buf_write_char calls (, and :);
- a matching
free(tokens_buf.data)at the end.
For the same prepared SQL re-executed N times, that work runs N times even
though sqlite3_column_name(stmt, i) returns the same pointer for the
lifetime of the prepared statement and the encoded tokens are byte-identical
across re-executions. The C-side statement cache already keeps the prepared
statement, its bind-parameter count (exp 077), its read-tables / column
dependencies (exp 106), and the writer's pre-prepared TX statements (exp 101)
across calls. The JSON tokens are the same shape of "depends on the prepared
SQL only" state and have nowhere to live between calls.
Hypothesis
If the pre-encoded JSON column-name tokens are cached on the resqlite_cached_stmt
entry, every selectBytes() call after the first reuses them — eliminating
the per-query buf_init malloc/free pair and the first-row pre-encode walk.
The per-row loop is bit-identical to exp 190 (one buf_write per column per
row), so per-row work stays exactly where it is.
The expected signal:
- small rowsets where the per-query setup is a measurable fraction of total
wall (~0.5 µs setup vs ~5–10 µs total per call) should move 5–10%;
- large rowsets where per-row work dominates (~3 ms total per call) should
stay neutral — the 0.5 µs savings is ~0.02% of wall;
- mixed-column shapes should track the same direction as int-only shapes.
Acceptance criterion: two order-flipped focused passes must improve the
small-rowset multi-column primary lanes by more than 5%, with the
large-rowset guard neutral, and the existing exp 190 wide_cols.dart
shapes either neutral or improved.
Approach
- extend
resqlite_cached_stmtwithjson_name_tokens_buf,
json_name_token_offsets, json_name_token_lens, json_name_tokens_len,
and json_name_tokens_col_count (the build sentinel);
- add
ensure_json_name_tokens(entry, stmt, col_count)which builds the
tokens on first call and returns immediately on subsequent calls. The
token shape is identical to exp 190 — column 0 emits "name":,
columns 1+ emit ,"name": — so the per-row inner loop is unchanged;
- change
write_json_to_buf(stmt, b)towrite_json_to_buf(stmt, entry, b);
the caller (resqlite_query_bytes) already has entry in scope from the
preceding get_or_prepare_reader call;
- remove the per-query
tokens_buf+_token_offsets_stack[64]+
_token_lens_stack[64] + _col_names_stack[64] + _col_name_lens_stack[64]
locals and the matching free() paths from write_json_to_buf's cleanup;
- add
stmt_cache_entry_clear_json_name_tokens(entry)and call it from
stmt_cache_entry_dispose so cached buffers are freed on entry eviction
and connection close.
A degenerate col_count <= 0 case (no columns to encode) flags the entry
as built via json_name_tokens_col_count = -1 so the build path is not
re-entered.
The change does not alter any user-visible JSON output — the encoded
tokens are byte-identical to exp 190.
Added the focused harness
benchmark/experiments/select_bytes_repeated_calls.dart
with:
- 1 row × 8 / 20 int cols, 1 row × 8 mixed cols (primary small-rowset lanes);
- 10 rows × 8 / 20 int cols (mid-size primary lanes);
- 100 rows × 8 int cols, 1000 rows × 8 int cols (regression guards where
per-row work should dominate any per-query setup savings).
Each lane reports median microseconds per call over 1000 calls/sample ×
11 samples, after a 16-call warm-up, so per-call savings around 0.5 µs are
visible above the harness floor.
Results
Focused select_bytes_repeated_calls.dart, two order-flipped passes
(medians in µs/call):
| Lane | Pass 1 baseline | Pass 1 candidate | Δ | Pass 2 baseline | Pass 2 candidate | Δ |
|---|---|---|---|---|---|---|
| 1 row × 8 int cols | 10.415 | 10.159 | −2.5% | 10.019 | 9.936 | −0.8% |
| 1 row × 20 int cols | 5.730 | 5.204 | −9.2% | 5.355 | 4.967 | −7.2% |
| 1 row × 8 mixed cols | 5.084 | 5.045 | −0.8% | 4.751 | 4.752 | +0.0% |
| 10 rows × 8 int cols | 7.332 | 7.128 | −2.8% | 6.936 | 7.001 | +0.9% |
| 10 rows × 20 int cols | 11.027 | 10.460 | −5.1% | 10.304 | 10.027 | −2.7% |
| 100 rows × 8 int cols | 29.750 | 30.296 | +1.8% | 29.936 | 29.120 | −2.7% |
| 1000 rows × 8 int cols | 251.270 | 253.841 | +1.0% | 251.156 | 246.973 | −1.7% |
Pass 1 ran baseline first then candidate; Pass 2 ran candidate first then
baseline. The 1-row × 20 col and 10-row × 20 col primary lanes reproduce
the predicted shape — same-direction movement across the flip, magnitude
inside the same band. 1 × 8 mixed barely moves: the per-query setup is
already < 1 µs at this shape and the harness sees it inside variance. The
100 / 1000 row guards show sign reversal across the flip (+1.8% / −2.7%
and +1.0% / −1.7%), which under exp 177's drift discriminator rule
classifies as drift-suspected rather than reproduced — consistent with
"no real movement" at sizes where per-row work dominates.
Exp 190's existing select_bytes_wide_cols.dart harness, two order-flipped
passes (medians in ms/call):
| Lane | Pass 1 baseline | Pass 1 candidate | Δ | Pass 2 baseline | Pass 2 candidate | Δ |
|---|---|---|---|---|---|---|
| 10k rows × 8 int cols | 2.479 | 2.426 | −2.1% | 2.467 | 2.427 | −1.6% |
| 10k rows × 20 int cols | 5.868 | 5.789 | −1.3% | 6.178 | 5.770 | −6.6% |
| 10k rows × 8 mixed cols | 2.660 | 2.534 | −4.7% | 2.618 | 2.586 | −1.2% |
| 10k rows × 20 mixed cols | 6.257 | 6.077 | −2.9% | 6.216 | 6.143 | −1.2% |
| 10k rows × 2 int cols | 0.740 | 0.714 | −3.5% | 0.737 | 0.711 | −3.5% |
| 1 row × 5 mixed cols | 0.018 | 0.018 | n/a | 0.017 | 0.017 | n/a |
| 100 rows × 5 mixed cols | 0.033 | 0.033 | n/a | 0.034 | 0.033 | n/a |
Same-direction movement on every 10k-row lane across the flip. The 1-row
and 100-row lanes on this harness sit at the harness's millisecond-reporting
resolution floor — the focused select_bytes_repeated_calls.dart harness
above is the correct lane for those shapes.
Decision
Accepted / In Review. The per-query setup amortization is small but
reproducible on the predicted shape: 1-row × 20-col and 10-row × 20-col
lanes move 5–9% same-direction across two order-flipped passes, where the
~0.5 µs of removed per-query work is a measurable fraction of total wall.
The 10k-row lanes also trend candidate-faster across both passes on every
shape, consistent with eliminating the per-query malloc(64) + free
pair from the hot path. The 100 / 1000-row guards show drift-suspected
sign reversal — consistent with "no real movement" where per-row work
dominates per-query setup by 100× or more.
The implementation is bounded — five new fields on the cache entry, one
new helper, a one-arg signature change on write_json_to_buf. Memory cost
per cached statement is O(col_count * 8 + name_byte_count) bytes, capped
by STMT_CACHE_MAX = 32 per reader/writer connection (~tens of KB worst
case for a fully populated cache of wide schemas). Public JSON output is
byte-identical to exp 190.
The release suite is not the right denominator for this change: the
public Select Bytes lanes either return 1K+ rows (per-row work dominates)
or are large-payload-bounded (per-row work amplified). The durable gate
is select_bytes_repeated_calls.dart for small repeated queries plus
select_bytes_wide_cols.dart for the broader compatibility band.
Future Notes
- Exp 190's per-row inner loop is unchanged. Any future per-row encoder
work (e.g. a fractional-REAL fast path following exp 194) should not
need to touch the token cache. The cache only stores invariant token
bytes; per-cell value emission is the same code path as today.
- The
_token_offsets_stack[64]/_token_lens_stack[64]stack fallback
exp 190 used for wide schemas is no longer needed: the cache entry's
per-column arrays are heap-allocated unconditionally, paid once per
statement instead of once per query.
- If a future workload calls
selectBytes()with hundreds of distinct
prepared SQLs, the per-statement cache cost (~hundreds of bytes per
entry) becomes more visible at the existing STMT_CACHE_MAX = 32 cap.
Reopen if release-suite memory probes flag the cache size.
- The same "promote to stmt-cache" shape could plausibly apply to the
exp 190-style token pre-encoding for any other future per-query
invariant work (e.g. per-column type lookups if a future workload
shows type dispatch is hot). Pattern-match on "per-query, depends only
on the prepared stmt" before reaching for a per-query scratch buffer.
Validation
dart pub getdart format benchmark/experiments/select_bytes_repeated_calls.dartdart analyze --fatal-infos native/resqlite.c benchmark/experiments/select_bytes_repeated_calls.dartdart test test/database_test.dart(52 tests pass)dart run benchmark/experiments/select_bytes_repeated_calls.darton baseline and candidate in baseline-first orderdart run benchmark/experiments/select_bytes_repeated_calls.darton candidate and baseline in candidate-first orderdart run benchmark/experiments/select_bytes_wide_cols.darton baseline and candidate in both orders (no-regression guard on exp 190's shapes)