Should I strip personal data, or mask it?

It depends on the workload. Stripping is safer but lossy — the model has less context. Masking with deterministic placeholders preserves identity references and is the right default for most chat and summarization flows.

Can I rely on regex alone?

Regex catches structured data (emails, IBANs, card numbers) cleanly. It struggles with unstructured names, addresses, and locations. A combined approach — regex for structure, an NER model for free text — is what production systems use.

Where should the stripping happen?

At a gateway, not inside individual services. Centralizing the policy is the only way to keep it consistent across teams.

How to remove personal data before sending to GPT

Three strategies, ranked

Most "remove the personal data" decisions fall into one of three patterns. Each has a different tradeoff between recall, capability, and integration cost.

1. Strip

Delete every recognized identifier. The model gets the cleanest prompt and the fewest hooks back to a real person. The downside is capability loss: the model can no longer answer "draft a reply to the customer" because it does not know which customer.

Best for: classification, sentiment analysis, summarization of non-identifying content.

2. Mask with deterministic placeholders

Replace each value with a typed token like EMAIL_1 or PERSON_2. The model can still tell which entity is which inside a single prompt, and your application can rehydrate the response with the real values.

Best for: chat, drafting, support automation, anywhere the response needs to feel personal.

3. Hash or tokenize

Replace values with stable hashes that survive across requests. This is useful when you need cross-request identity (e.g. for analytics) but it introduces a persistence problem: anything stable is eventually linkable.

Best for: aggregate analysis, not real-time chat. Privian deliberately does not do this in the gateway — see Zero retention.

What good detection looks like

Detection is the hard part. A serious implementation usually combines:

Regex for structured data — emails, phone numbers, IBANs, credit cards (with Luhn checks), known secret patterns.
NER (named entity recognition) for free-text names, locations and organizations that regex cannot catch.
Validators to reduce false positives — a Luhn-valid 16-digit number is much more likely to be a card.
A fallback path for ambiguous cases so the system fails closed.

A working example

# Before: prompt that exposes a customer
Summarize the email from jane@example.com about IBAN
DE89370400440532013000 and the failed payment.

# After: masked by the gateway, sent to the provider
Summarize the email from EMAIL_1 about IBAN
IBAN_1 and the failed payment.

# Response from the provider (still uses tokens)
EMAIL_1 reported a failed payment on IBAN_1 and is
asking for a refund.

# Rehydrated response returned to your app
jane@example.com reported a failed payment on
DE89370400440532013000 and is asking for a refund.

How Privian fits

Privian implements pattern #2 by default: deterministic masking with in-memory rehydration. The gateway handles detection, substitution, forwarding, and the return path. You change a base URL; the policy travels with the request. See the PII Masking page for the supported entity types and validators.

Written under our editorial principles: implementation-grounded, honest about limitations, educational first.

Frequently asked questions

Should I strip personal data, or mask it?: It depends on the workload. Stripping is safer but lossy — the model has less context. Masking with deterministic placeholders preserves identity references and is the right default for most chat and summarization flows.
Can I rely on regex alone?: Regex catches structured data (emails, IBANs, card numbers) cleanly. It struggles with unstructured names, addresses, and locations. A combined approach — regex for structure, an NER model for free text — is what production systems use.
Where should the stripping happen?: At a gateway, not inside individual services. Centralizing the policy is the only way to keep it consistent across teams.