| feat: preserve raw bytes when anonymization is a no-op
When the anonymizer doesn't change a slice's text, the streamer used
to push Buffer.from(out, "utf8") — which loses any invalid-UTF-8 bytes
in the input (replaced by U+FFFD via StringDecoder). Files
mistakenly classified as text (binary blobs without a known extension,
text with stray non-UTF-8 bytes, BOMs) came out corrupted even though
nothing in the term list matched.
Track the raw chunk bytes alongside the decoded `pending`. On flush —
where we have every byte buffered — emit the original buffer directly
when the output equals the input, so a pure passthrough is bit-exact.
In the streaming OVERLAP path, do the same when the decode for that
slice round-trips losslessly; fall back to encoded output otherwise
(unchanged from before for that case).
Also add the "missing_content" locale entry for the
/api/anonymize-preview route.
| 25 天前 |
| Improve error handling
| 23 天前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| repo change + daily stat improvements
| 18 天前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| Standardize error responses with consistent format and human-readable messages (#667) | 1 个月前 |
| fix: include file path in cache ETag
Without the path, two different files in the same repo (same sha, same
anonymization options) shared an ETag. If a browser ever sent the cached
ETag for one file while requesting another, the server would have
returned 304 against the wrong cache entry. Fold the path into the
ETag so each file has its own fingerprint.
Follow-up to b3c1030 (#439).
| 26 天前 |
| fix persistance bugs
| 23 天前 |
| fix persistance bugs
| 23 天前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| improve binary file detection: content sniffing + jsonl support
Files like .jsonl that mime-types doesn't know fell through to
application/octet-stream and rendered as "Unsupported binary file" in
the viewer. Replace istextorbinary with isbinaryfile for content-based
detection, and use mime-types for name-based classification with a
textual application/* allowlist.
The streaming transformer now defers classification when the name is
inconclusive and sniffs the first chunk before emitting "transform",
so route.ts and AnonymizedFile.ts get a content-aware Content-Type.
Whitelists .jsonl and .ndjson to short-circuit dataset files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| 23 天前 |
| fix: resolve eslint unused-var and useless-assignment warnings
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| 23 天前 |
| Replace isomorphic-dompurify with sanitize-html for Node 21 compat (#663) | 1 个月前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| fix test
| 23 天前 |
| Fix streamer crash and misclassified transient GitHub errors
Add missing error handler on the anonymizer transform stream in the
streamer route — without it, an upstream error tears down the pipe and
the anonymizer emits an unhandled error that crashes the process
(surfacing as ECONNRESET to the main server).
Classify transient network errors (ReadError, ECONNRESET, ETIMEDOUT)
as upstream_error/502 instead of file_not_found/404 so they are
distinguishable in logs and don't cache-poison downstream.
Update handleError tests to match the existing sanitization behavior
that returns internal_error for non-AnonymousError instances.
| 22 天前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |
| Fix 9 bugs and add 103 tests for core anonymization, config, and routing (#669) | 1 个月前 |