Skip to content
Atlas Stress Test · Vol. 7

Encoding Attack Lab — will the compiler hold?

Every section below carries at least one deliberate weapon: an entity, a bidi mark, a hostile URL, an unquoted attribute, or a non-BMP code point. If a section reads cleanly in your browser but the Atlas output is mangled, that’s the bug.

A1

Named & numeric entities in text

Plain ampersand: Tom & Jerry. Less-than: x < y and greater-than: y > x. Quotes: "double" and apostrophes: it's versus it's versus it's — the same glyph by three different escape paths.

Non-breaking spaces between short tokens collapse into the prose. A copyright glyph © 2026, an em-dash — like this — and a horizontal ellipsis… complete the named-entity menagerie. Numeric variants: — (em-dash, hex), – (en-dash, decimal), … (ellipsis, hex).

A2

Double encoding

Single-encoded: <script> — should render visibly as the literal string <script> , not as a tag.

Double-encoded: &amp;amp;lt;. After one decode it becomes &amp;lt;. After two decodes it becomes &lt;. After three, it becomes <. We want the original byte sequence preserved verbatim — if the compiler eagerly decodes, this turns into a literal less-than and could prematurely tag-open downstream.

Triple-encoded entity-stuffing test: &amp;amp;amp;copy;. And the adjacent-entity stuffing case: &&copy; — a stray ampersand followed by a valid entity name; some parsers happily merge them.

A3

Bidi overrides & invisible code points

The next paragraph contains a U+202E RIGHT-TO-LEFT OVERRIDE mid-string, which historically is used to spoof filenames in phishing payloads. It should be preserved verbatim, not dropped.

filename‮gpj.exe.txt is the classic spoofed-extension trick.

And here is U+202D LEFT-TO-RIGHT OVERRIDE: ‭forced-ltr-تجربة-end.

U+202B + U+202C bracket: ‫مرحبا، world‬ — should remain balanced.

Zero-width zoo: A‎B‏C​D‍E‌F (LRM, RLM, ZWSP, ZWJ, ZWNJ inside one inline code block).

Leading BOM in this paragraph: "the BOM is right before this quote".

Marker Code point Inline specimen
LRM U+200E before‎after
RLM U+200F before‏after
ZWSP U+200B be​fore
ZWJ U+200D a‍b
ZWNJ U+200C a‌b
BOM U+FEFF leading
A4

Combining diacritics — NFC vs NFD

Pre-composed (NFC) form: café — one code point for é.

Decomposed (NFD) form: café — two code points: e (U+0065) followed by combining acute accent (U+0301). Visually identical, byte-different.

Stack abuse (Zalgo-lite): ñ̈̃́̀ — base letter with multiple combining marks. If the compiler’s attribute serialiser strips combining marks individually, the glyph degrades.

The attribute on this paragraph carries the NFD form of "café" and should round-trip byte-exact.

A5

Encoded attribute payloads

Quotes encoded inside an attribute value, plus a fake script-tag string in another attribute.

The next link uses a numeric-entity colon to disguise a javascript: scheme: numeric-colon javascript link.

The next link uses a hex-entity colon: hex-colon javascript link.

Mixed: case-mixed scheme.

A6

Mixed quote schemes

The three attribute-quoting forms HTML5 permits, side by side. The fourth form below uses none at all where the parser usually tolerates it.

plain unquoted embedded doubles inside singles embedded singles inside doubles literal ampersand
A7

Whitespace inside attribute values

This paragraph carries (a) a literal newline-separated attribute, (b) a tab-separated attribute, and (c) a CRLF-encoded attribute. HTML5 says these are all valid; the question is whether the codegen normalises, escapes, or preserves them.

A8

CDATA inside SVG & bare HTML

The next SVG carries a <style> with a real <![CDATA[ … ]]> block. Browsers parse this; the question is whether the htmlparser2 / Atlas pipeline treats it as text, comment, or drops it outright.

Inline SVG with CDATA-wrapped style block.

And a stray <![CDATA[unclosed-style fragment in plain-HTML context follows:

A9

Comments that look like markup

Above this paragraph is a normal comment. Below this paragraph is a comment that contains a fake script tag. If the codegen stringifies comments back into the output verbatim, this is a live XSS payload — if it drops the comment, it's safe but lossy.

You are not using Internet Explorer (IE downlevel-revealed conditional).

Below: a comment wrapping an entire block tag (sometimes used as a poor-man's commented-out section).

A10

Hostile URL schemes & empty hrefs

Each link below carries a different hostile or pathological URL form:

And an image with a hostile src plus a hostile srcset:

A form action with a hostile scheme — should be either dropped, sanitised, or escaped by any responsible serialiser:

A11

Surrogate pairs & non-BMP code points

Single-codepoint emoji: 😀 (U+1F600). Family ZWJ sequence: 👨‍👩‍👧‍👦 (man + ZWJ + woman + ZWJ + girl + ZWJ + boy — one rendered glyph, seven code points).

Skin-tone modifier: 👋🏽 (U+1F44B + U+1F3FD). Flag sequence: 🇦🇺 (regional indicator A + regional indicator U).

Mathematical alphanumeric: 𝕏 (U+1D54F), 𝓒𝓱𝓮𝓼𝓽𝓮𝓻 in script form.

Astral attribute carrier: hover for emoji-loaded title.

And surrogate pair inside a numeric entity: 😀 renders as 😀.

A12

Malformed & ambiguous entities

Unterminated entity (no semicolon): AT&T — HTML5 says this is a literal ampersand because amp is followed by a non-semicolon, non-alpha char.

Ambiguous-ampersand: foo&bar=1 in flowing prose (should stay literal).

Followed-by-valid-entity: &&copy; — stray ampersand abutting a valid entity name.

Non-existent entity name: &notarealentity; — should be left as literal text by HTML5.

Numeric overflow: &#9999999999; — out-of-range decimal numeric character reference.

Lone surrogate via numeric reference: &#xD83D; — half of a surrogate pair, technically invalid Unicode.

A13

Mixed-bag prose paragraph

A single paragraph that combines: an entity (Tom & Jerry), a non-breaking space (a b), an em-dash (here—and there), a U+202E override mid-string (filename‮gpj.exe.txt), a zero-width joiner (a‍b), a surrogate-pair emoji (😀), a non-BMP letter (𝕏), an Arabic substring (مرحبا), a NFD é (café decomposed), and a literal less-than escape: <not-a-tag>. If any one of those round-trips incorrectly, the rendered byte sequence diverges from source.

A14

<pre> with literal angle brackets, tabs, BOM

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>Tom & Jerry & Spike</title>
  </head>
  <body>
	<p>Indented with a tab, leading BOM at top of pre.</p>
	<p data-x=""quoted"">Doubly-escaped quotes.</p>
  </body>
</html>
A15

Boolean & valueless attributes

Three forms of the same boolean attribute on three buttons:

And a checkbox that is checked, required, and readonly via three different syntaxes: