pii-scrubber
PII scrubber for Muntin Ledger. Strips SSNs, US phone numbers, emails, and ACH account-shaped sequences from text before it lands in:
- Sentry payloads (exception messages, stack-frame locals) --
wired into the Sentry beforeSend hook.
- Audit log values -- applied to any string value before it is
written to audit_events.
This module replaces the pre-pivot tools/redaction-middleware/, which lived between worker memory and the Anthropic LLM API. With the zero-LLM pivot (v4 plan PR-2) there is no LLM call to scrub ahead of, but the two surfaces above remain -- error reports and the audit log persist beyond the request, and PII in those paths violates the privacy commitment _"we never log invoice content."_
Open-source (MIT). Single file, stdlib only, so anyone can audit it line-by-line. The tests are the privacy claim.
What this is NOT
- Not a content-redaction tool. Vendor names, addresses, line
items -- the actual invoice content -- are not scrubbed by this module. They are not supposed to land in Sentry or audit logs in the first place; the typed-logger conventions (no console.* in customer-data paths, enforced by scripts/privacy-ci.sh) cover that.
- Not a general-purpose DLP tool. The rule set is deliberately
narrow. Patterns NOT covered today:
- Credit card numbers (16-digit PANs). Most small-restaurant invoices don't contain them; lands when a paying customer surfaces a need. - SSNs without dashes (123456789). Adds false-positive risk on contiguous-9-digit invoice numbers; deferred until measured. - International phone numbers (E.164, +44, +33...). Out of scope until non-US customers exist. - Bank account numbers without a separator between routing and account portions. Required separator avoids false positives on GTIN-13 product codes and other long numeric IDs.
Each additional pattern lands with a regression test for the false-positive case it might introduce.
Public API
```python from redaction import redact, scrub_payload
Single string scrub.
r = redact("Email ap@sysco.com or call 617-555-1234")
r.text: "Email {{PII_EMAIL_1}} or call {{PII_PHONE_1}}"
r.mapping: diagnostic only -- do not persist alongside scrubbed text.
Walking a structured payload (Sentry beforeSend / audit log value).
scrubbed = scrub_payload({ "exception": "ValueError: contact 617-555-1234", "stack_locals": {"email": "ap@sysco.com"}, })
Returns the same shape with strings scrubbed.
```
Install + test
``sh cd tools/pii-scrubber python -m venv .venv && source .venv/bin/activate pip install -e ".[test]" pytest tests/ -v ``
CI runs the same tests on every push (job: pii-scrubber-tests).
Two commitments backed by these tests
- No regex-class PII pattern reaches Sentry payloads. The
scrub_payload walker is wired into the Sentry beforeSend hook; tests prove every pattern is replaced with a placeholder when the walker visits a string.
- No regex-class PII pattern reaches audit log values. The
walker is also called by the audit log value sanitizer.
What changed from the pre-pivot module
- Renamed
tools/redaction-middleware/->tools/pii-scrubber/.
The -middleware suffix implied a request-pipeline component; the actual surface is now smaller (Sentry + audit log) and "middleware" overstates what it is.
- Removed
unredact_in_dict()-- it existed to unredact LLM
responses, and there is no LLM round-trip anymore.
- Added
scrub_payload()walker for the Sentry / audit-log
surfaces (the new primary callers).
- Package version bumped 0.1.0 -> 0.2.0 to track the rename.
License
MIT, see LICENSE in this directory.