📖🤖Tome Robot
← All posts
June 18, 2026

How we built real-time PII detection with DOM inspection and vision models

Protecting sensitive information during process documentation is non-negotiable. This article details our multi-layered approach to real-time PII detection, combining deterministic rules, DOM inspection, and vision models for robust redaction.

Documenting internal operational processes, especially through live walkthroughs, invariably exposes sensitive information. Credit card numbers, social security numbers, email addresses, and other Personally Identifiable Information (PII) can appear in unexpected places, even in seemingly innocuous test environments. Manual redaction is a tedious, error-prone exercise that scales poorly. An automated, robust PII detection and redaction pipeline is not merely a feature; it is a fundamental requirement for any platform capturing live operational data.

The Imperative for Layered PII Detection

The challenge with PII detection lies in its diverse presentation. A credit card number might be typed into a dedicated input field, appear in a customer service chat log, or be visible within a screenshot of an invoice. Relying on a single detection mechanism is a recipe for failure. Our approach embraces a three-layered strategy, each designed to catch different forms and contexts of PII, building resilience against evasion.

This philosophy prioritizes precision and recall, acknowledging that in the context of sensitive data, a false positive (over-redaction) is preferable to a false negative (data leakage). The pipeline must operate with minimal latency to maintain a fluid user experience during recording and swift processing for documentation generation.

Layer One: Deterministic Pattern Matching with Validation

The foundational layer of our PII detection relies on deterministic pattern matching, primarily through regular expressions (regex). This layer targets well-defined PII formats:

  • Credit Card Numbers: We employ regex patterns that identify common credit card prefixes (e.g., Visa, MasterCard, Amex) combined with length constraints (typically 13-16 digits). Crucially, every detected sequence undergoes a Luhn algorithm checksum validation. This significantly reduces false positives from arbitrary number sequences, ensuring that only plausible credit card numbers are flagged.
  • Social Security Numbers (SSN): Standard SSN formats (NNN-NN-NNNN) are straightforward targets for regex. While there isn't a simple checksum like Luhn for SSNs, contextual clues (discussed in Layer Two) often provide additional validation.
  • Email Addresses: Widely accepted regex patterns effectively identify email addresses, balancing strictness with the diversity of valid email formats.
  • Phone Numbers: This is a more complex category due to global variations. Our system includes patterns for common international and national formats, often starting with country codes or area codes. The sheer variability means this layer is usually combined with contextual information for higher confidence.

This layer is fast and highly precise for its specific targets. However, its limitation is clear: it only detects PII that adheres to a known, rigid pattern. It cannot interpret semantic meaning or visual context.

Layer Two: DOM-Level Input Sniffing

When recording user interactions within a web browser, the Document Object Model (DOM) provides a rich source of semantic information that goes beyond raw text. This layer inspects the HTML structure and element attributes to infer the presence of PII fields, even before any data is entered or submitted.

  • Input Type Attributes: HTML5 introduced semantic input types like type="email", type="tel", and type="password". These are strong indicators. A field marked type="password" should always be redacted, regardless of its content. Similarly, type="email" provides a high-confidence signal for email addresses.
  • Name and ID Attributes: Developers frequently use descriptive name and id attributes for input fields, such as "cardNumber", "ssn", "billingAddress", "cvv", or "dateOfBirth". Our system maintains a curated dictionary of hundreds of such keywords and their variations, dynamically scanning the DOM tree for these identifiers.
  • Autocomplete Attributes: The autocomplete attribute (e.g., autocomplete="cc-number", autocomplete="email") is another explicit signal that a field is designed to hold specific PII.
  • Neighboring Labels and Placeholder Text: Context is paramount. If an input field has an associated <label> element or placeholder text containing terms like "Credit Card Number," "Social Security Number," or "Date of Birth," it triggers a PII flag. This contextual analysis extends to parent elements, searching for headers or descriptive text that might label a section of the form.

DOM-level sniffing is exceptionally efficient, often completing its analysis in sub-100ms per recorded step. It offers high precision because it leverages the developer's explicit intent encoded in the HTML. Its primary limitation is that it only applies to interactive web elements; it cannot detect PII embedded in static text, images, or non-input fields.

Layer Three: Vision Model Screenshot Scanning

The most comprehensive, but also the most computationally intensive, layer involves analyzing the visual representation of the web page via screenshots. This addresses PII that Layer Two might miss, such as data displayed in a PDF viewer embedded in the page, text in an image, or arbitrary paragraphs of text that happen to contain PII.

  • Optical Character Recognition (OCR): Each screenshot captured during a walkthrough is subjected to OCR. We employ robust, pre-trained OCR models that perform well across various fonts, sizes, and backgrounds. The output is a collection of text strings along with their bounding box coordinates on the screenshot.
  • Post-OCR Pattern Matching: The text extracted by OCR is then fed through the same deterministic pattern matching (Layer One) engine. This allows us to identify credit card numbers, SSNs, emails, and phone numbers even when they appear as static text in an image.
  • Visual PII Detection Models: For certain types of PII, especially those with distinct visual characteristics, we utilize specialized vision models. For instance, a model can be trained to detect the visual presence of a physical credit card image or a driver's license within a screenshot. These models operate on the raw pixel data, identifying regions that correspond to known PII artifacts. This is particularly useful for documents or images that are not purely text-based.

The primary challenge with vision models is computational cost and potential latency. Running high-quality OCR and vision models typically occurs server-side, asynchronously, to avoid impacting the recording experience. This layer also contends with the inherent inaccuracies of OCR, which can lead to missed characters or incorrect text extraction, subsequently affecting pattern matching accuracy. To mitigate this, we often run multiple OCR engines or use ensemble methods.

An example: A customer support agent navigates to an internal tool where a customer's partially redacted credit card number (e.g., **** **** **** 1234) is displayed as part of an order history. Layer One might not flag this due to the redaction. Layer Two cannot act because it's not an input field. Layer Three, however, with OCR, could extract the surrounding text "Last 4 digits of CC: 1234" and, combined with contextual analysis, flag the visible digits and redact them, even if the full number isn't present.

The Integrated Redaction Pipeline

These three layers operate in concert. During a recording session, as each user action is captured, a DOM snapshot and a screenshot are simultaneously taken. The DOM snapshot is processed immediately by Layer Two (DOM sniffing) and Layer One (regex on DOM text content). This provides initial, high-confidence redaction targets.

Concurrently, the screenshot is sent to our processing backend, where Layer Three (OCR and vision models) performs its analysis. The results from all layers are then aggregated. Overlapping detections are consolidated, and a final set of bounding boxes and text ranges for redaction is generated. This ensures comprehensive coverage, where a piece of PII missed by one layer due to its limitations might be caught by another.

The output of this pipeline is a set of redaction instructions that are applied to the visual artifact (the screenshot is pixelated or blurred in the identified regions) and the underlying text data (masked or replaced with placeholder text). This multi-stage, multi-modal approach significantly reduces the risk of PII leakage in generated documentation, providing a robust defense against accidental exposure.

Building an automated PII detection and redaction system demands a pragmatic, multi-faceted approach. Relying on a single technology will inevitably leave critical gaps. By combining deterministic pattern matching, DOM-level semantic analysis, and sophisticated vision models, it is possible to construct a highly effective defense against sensitive data exposure. This ensures that essential process documentation, like that generated by Tome Robot, remains both informative and secure, a critical balance for modern operational integrity.

engineeringsecurityai

Stop writing docs nobody reads.
Record them instead.

Install the extension, walk through the tool you're tired of explaining. Tome Robot does the rest.