Experiment 069: SQL fingerprint in stmt cache

Date: 2026-04-16

Status: Deferred (proper normalization requires SQL rewriter)

Problem

The C statement cache keys on the raw SQL string. Applications that build

SQL with string concatenation instead of bind parameters thrash the cache:

 db.execute('SELECT * FROM users WHERE id = 5'); db.execute('SELECT * FROM users WHERE id = 6'); db.execute('SELECT * FROM users WHERE id = 7');

Each SQL string is distinct → each creates a separate cache entry → with a

32-entry cap, a tight loop like this evicts useful entries and re-prepares

constantly. The fix in principle: normalize literals to ? before caching,

so all three variants share one prepared statement.

Analysis (Why Deferred, Not Rejected)

The team's proposal: "Normalize literals → ? before stmt_cache_lookup…

SQLite's own sqlite3_normalize() is an option if you enable that build

flag."

Two problems with implementation:

sqlite3_normalized_sql() takes a prepared statement as input, not

raw SQL. It's designed to tell you what a compiled statement's

normalized form looks like — useful for logging and statement tracing,

not for pre-prepare lookup. To use it as a cache key, we'd have to

prepare first, normalize, then look up — by which point we've already

paid the prepare cost we were trying to avoid.

Even if we had a pre-prepare normalizer, reusing a prepared statement

across different literal values requires rewriting the SQL + binding

the extracted literals as parameters. A prepared stmt compiled for

WHERE id = 5 has the literal 5 baked into its bytecode; we can't

step it with id = 6 without preparing a new statement. So

normalization must be accompanied by actual SQL rewriting — replace

each literal with a ?, collect literals into a parameter list, and

bind them before step.

SQL rewriting requires a full tokenizer that handles:

String literals (single-quoted), with embedded escaped quotes
Blob literals (X'…')
Numeric literals including negative, hex, and scientific forms
Comments (-- and //)
Identifiers (which may look like literals if unquoted)
Multi-statement SQL (DDL + DML mixes)

This is ~300-500 lines of careful C with significant edge-case risk.

Not a "quick win."

Simpler Variants Considered

Hash-based lookup (replace memcmp(sql) with int compare): doesn't

improve hit rate, only lookup speed. Current 32-entry linear scan with

memcmp is already cache-resident and fast. Below noise.

Grow cache size 32→128 (team's idea 10): doesn't solve the thrashing

problem for apps that generate unbounded distinct queries; just postpones

the eviction. Measure eviction rate first to decide if this is worth it.

Decision

Deferred. The impact is real for apps that don't use bind parameters,

but a proper implementation is a 300+-line SQL rewriter with edge-case

risk that doesn't match the "quick win" framing. Revisit if:

A production app surfaces prepare-storm telemetry suggesting this is

the bottleneck.

We build out a proper sqlite3_normalize() wrapper (possibly via

SQLITE_ENABLE_NORMALIZE compile flag + a custom pre-prepare pass).

Experiment 003 (C-level connection + statement cache) — original design
Experiment 059 (row count hint) — adjacent; also about cache metadata

Experiment 069: SQL fingerprint in stmt cache

Problem

Analysis (Why Deferred, Not Rejected)

Simpler Variants Considered

Decision

Related