A prior fix stopped /tutorials from crashing, but the page now silently shows only 10 of 30 tutorials and replaces every em-dash with a question-mark-looking character — the underlying UTF-8 handling bug is still there.
Canonical message
Multibyte UTF-8 (em-dash, arrow, CJK) in db.query() results arrives at HTTP response body as U+FFFD; loops over json.get(rows.N.title) terminate early. Compiler-only 30-item repro using string literal + concat + deep json.get + loop reproduces NONE of the symptoms on cln 0.30.323 (see tests/cln/bugfixes/utf8_multibyte_json_get_roundtrip.cln in clean-language-compiler — all 30 multibyte titles render correctly). Corruption is NOT in compiler-emitted WASM. The remaining candidates are: (1) the actual db.query bridge implementation under load (node-server's utf8-roundtrip.test.ts apparently does not cover the real production code path), or (2) the host runtime's memory/buffer handling under the 16 MB pre-grow workaround introduced in 0.1.59.
node-server
0.1.59
bridge
1
1
81.0
2026-06-19 13:50:44
2026-06-19 13:50:44
Latest Report Context
regression
unknown
PARTIAL FIX of report 34a3e2c7-b712-4b4a-9515-dca9650bc9da: clean-node-server 0.1.59 stops the WASM heap exhaustion crash, but the underlying UTF-8 multibyte handling bug is still present — responses contain U+FFFD replacement characters and rendering silently truncates after partial content.
// Same minimal repro as #34a3e2c7 — but the regression is that 0.1.59 now
// returns 200 with corrupted body instead of crashing.
plugins:
frame.server
frame.data
endpoints server:
GET "/repro" :
// Any string from a db.query() result that contains a 3-byte UTF-8 char
string sql = "SELECT 'Tutorials — Learn Clean Language' as title"
string result = db.query(sql, "[]")
string title = json.get(result, "data.rows.0.title")
return http.respond(200, "text/plain", title)
// Expected response body: Tutorials — Learn Clean Language
// Actual response body: Tutorials ��� Learn Clean Language
// (each em-dash byte E2 80 94 becomes 3 separate U+FFFD chars EF BF BD)
// To reproduce the silent truncation, add a loop that walks json.get for
// many items where any value contains a 3-byte UTF-8 char — rendering stops
// after ~10 items even though the loop should iterate 30 times.
Fix the root cause in the JSON bridge: every length calculation that crosses the Node/WASM boundary must use UTF-8 byte length (Buffer.byteLength(str, 'utf8')), never str.length (UTF-16 code units).
When the fix is correct, the same WASM + same DB should produce:
- 30 tutorial cards (3 track sections, 10 cards each)
- Em-dashes (`—`), arrows (`→`), smart quotes (`""`), and ñ in "Español" rendered correctly as their actual code points
- 0 U+FFFD characters
- Response body ~50KB
A useful self-test the team can add: load any JSON containing every BMP 3-byte UTF-8 sequence (em-dash U+2014, arrow U+2192, ellipsis U+2026, smart quotes U+201C/U+201D), pass it through json.get → string concat → http.respond, and verify zero U+FFFD in the output.
After upgrading clean-node-server from 0.1.55 to 0.1.59 to pick up the marked-resolved fix for report #34a3e2c7 (heap exhaustion / WASM crash on /tutorials):
**Setup (identical to the original repro):**
- Same WASM (MD5 14b8276e28c320842e4cb32e579ba84c, compiled with cln 0.30.316 + frame.server 2.7.9)
- Same production database (clean_website, 30 tutorial rows with em-dashes and other multibyte UTF-8)
- Tested via SSH tunnel from local clean-node-server 0.1.59 to prod MySQL on the droplet
**Result:**
- ✅ HTTP 200 (no more 500)
- ✅ Response in 1.7-2.6s (no longer OOM-kills the worker)
- ✅ Worker RSS stable at ~260MB across multiple requests
- ❌ Response body is **12,798 bytes** (full /tutorials page should be ~50KB)
- ❌ Only **10 of 30** tutorial cards rendered (only the "web-app" track at sort_order 11–20)
- ❌ The "first-steps" track (1–10) and "ai-coding" track (21–30) are silently missing
- ❌ Only **1 of 3** `
` blocks present
- ❌ …
AI Analysis
src/bridge/json.ts
Discovered during: Verifying the resolution of #34a3e2c7 against the production WASM and production database, after the dashboard marked it resolved.
The 0.1.59 fix is a workaround at the heap-allocation layer (per the description of the related #76d3b529 fix, "Pre-grow WASM memory to 16 MB at startup so memory.grow() is never called mid-request"). This stops the runaway allocation from triggering DataView detachment, but it doesn't fix the underlying string-length miscalculation that was driving the runaway allocation in the first place.
What's still broken: somewhere in the JSON bridge between Node's UTF-16 String and the WASM linear memory's UTF-8 buffer, the byte length and the character length are being confused. Most likely site: in writeLengthPrefixedString or its read counterpart, the cursor advances by str.length (UTF-16 code units) but the buffer was written using Buffer.byteLength(str, 'utf8'), or vice versa. When the cursor lands inside a multibyte sequence:
- Reading: the bytes are not a valid UTF-8 sequence → decoder emits U+FFFD per orphan byte
- Writing: the length-prefix says N but the actual bytes occupy N+k → reader walks off the end of one value into the next, eventually terminating the iteration early when it lands on a NUL or runs out of data
The "10 out of 30 cards rendered" symptom is consistent with this: after enough multibyte sequences corrupt the bridge state, json.get for subsequent items returns empty strings, and the `while title != ""` loop in the renderer exits early — which is EXACTLY what you'd see if the cursor desyncs.
The fact that the only 10 cards that render are from the middle track (sort_order 11-20, the "web-app" track) is telling: those are the items that previously had the most corrupted multibyte content (we saw earlier that this track had `→` arrows + em-dashes). The desync state may have aligned in a way that those particular items happen to render past the desync point. The first-steps track (1-10) and ai-coding track (21-30) are LOST.
Recommended fix: audit every length calculation in src/bridge/*.{ts,js} and add a unit test corpus containing the full set of common multibyte chars (em-dash U+2014, en-dash U+2013, right arrow U+2192, ellipsis U+2026, smart quotes U+2018-U+201D, ñ U+00F1, é U+00E9, € U+20AC, 你 U+4F60, 🦀 U+1F980) passed through every supported bridge path.
The pre-grow-to-16MB fix shipped in #76d3b529 (probably the same fix referenced in 0.1.59) addresses the heap-detachment symptom but not the UTF-8 length mismatch root cause. To finish the fix:
1. In src/bridge/*.{ts,js}, find every site that converts between a JS String and a WASM-memory byte buffer. Verify each side uses the same length semantic.
2. Wherever a length is WRITTEN to WASM memory as a length-prefix or used to advance a cursor: use `Buffer.byteLength(str, 'utf8')` — NEVER `str.length` (which counts UTF-16 code units).
3. Wherever a length is READ from WASM memory: trust it as a byte count and slice the buffer accordingly; decode using `.toString('utf8', start, start + byteLen)`.
4. Add a regression test fixture: a JSON object whose values are every common BMP 3-byte UTF-8 char. Pass it through json.get on every supported path depth (0, 1, 2, 3). Assert the output bytes equal the input bytes for each.
The visible signature of the bug being properly fixed: 0 U+FFFD in any response body, AND the response body byte length matches what the WASM logically emitted (no silent truncation).