Encoding Attack Lab — will the compiler hold?
Every section below carries at least one deliberate weapon: an entity, a bidi mark, a hostile URL, an unquoted attribute, or a non-BMP code point. If a section reads cleanly in your browser but the Atlas output is mangled, that’s the bug.
Named & numeric entities in text
Plain ampersand: Tom & Jerry.
Less-than: x < y and greater-than: y > x.
Quotes: "double" and apostrophes:
it's versus it's versus
it's — the same glyph by three different
escape paths.
Non-breaking spaces between short tokens collapse into the prose. A copyright glyph © 2026, an em-dash — like this — and a horizontal ellipsis… complete the named-entity menagerie. Numeric variants: — (em-dash, hex), – (en-dash, decimal), … (ellipsis, hex).
Double encoding
<script>
— should render visibly as the literal string
<script>
, not as a tag.
Double-encoded: &amp;lt;. After one decode it
becomes &lt;. After two decodes it becomes
<. After three, it becomes
<. We want the original byte sequence preserved
verbatim — if the compiler eagerly decodes, this turns into
a literal less-than and could prematurely tag-open downstream.
Triple-encoded entity-stuffing test:
&amp;amp;copy;. And the
adjacent-entity stuffing case:
&© — a stray ampersand followed
by a valid entity name; some parsers happily merge them.
Bidi overrides & invisible code points
The next paragraph contains a U+202E RIGHT-TO-LEFT OVERRIDE mid-string, which historically is used to spoof filenames in phishing payloads. It should be preserved verbatim, not dropped.
filenamegpj.exe.txt is the classic spoofed-extension trick.
And here is U+202D LEFT-TO-RIGHT OVERRIDE: forced-ltr-تجربة-end.
U+202B + U+202C bracket: مرحبا، world — should remain balanced.
Zero-width zoo: ABCDEF
(LRM, RLM, ZWSP, ZWJ, ZWNJ inside one inline code block).
Leading BOM in this paragraph: "the BOM is right before this quote".
| Marker | Code point | Inline specimen |
|---|---|---|
| LRM | U+200E | beforeafter |
| RLM | U+200F | beforeafter |
| ZWSP | U+200B | before |
| ZWJ | U+200D | ab |
| ZWNJ | U+200C | ab |
| BOM | U+FEFF | leading |
Combining diacritics — NFC vs NFD
Pre-composed (NFC) form: café — one
code point for é.
café
— two code points:
e
(U+0065) followed by combining acute accent (U+0301). Visually identical, byte-different.
Stack abuse (Zalgo-lite): ñ̈̃́̀ — base
letter with multiple combining marks. If the compiler’s
attribute serialiser strips combining marks individually, the
glyph degrades.
The attribute on this paragraph carries the NFD form of "café" and should round-trip byte-exact.
Encoded attribute payloads
Quotes encoded inside an attribute value, plus a fake script-tag string in another attribute.
The next link uses a numeric-entity colon to disguise a
javascript: scheme:
numeric-colon javascript link.
The next link uses a hex-entity colon: hex-colon javascript link.
Mixed: case-mixed scheme.
Mixed quote schemes
The three attribute-quoting forms HTML5 permits, side by side. The fourth form below uses none at all where the parser usually tolerates it.
plain unquoted embedded doubles inside singles embedded singles inside doubles literal ampersandWhitespace inside attribute values
This paragraph carries (a) a literal newline-separated attribute, (b) a tab-separated attribute, and (c) a CRLF-encoded attribute. HTML5 says these are all valid; the question is whether the codegen normalises, escapes, or preserves them.
CDATA inside SVG & bare HTML
The next SVG carries a <style> with a real
<![CDATA[ … ]]> block. Browsers parse
this; the question is whether the htmlparser2 / Atlas pipeline
treats it as text, comment, or drops it outright.
And a stray <![CDATA[unclosed-style fragment in
plain-HTML context follows:
Comments that look like markup
Above this paragraph is a normal comment. Below this paragraph is a comment that contains a fake script tag. If the codegen stringifies comments back into the output verbatim, this is a live XSS payload — if it drops the comment, it's safe but lossy.
You are not using Internet Explorer (IE downlevel-revealed conditional).
Below: a comment wrapping an entire block tag (sometimes used as a poor-man's commented-out section).
Hostile URL schemes & empty hrefs
Each link below carries a different hostile or pathological URL form:
- javascript: plain
- JAVASCRIPT: upper-case
- javascript: with leading whitespace
- data: text/html
- data: image/svg+xml (base64 with embedded script)
- vbscript:
- empty href
- bare hash
- no href at all
And an image with a hostile src plus a hostile
srcset:
A form action with a hostile scheme — should be either dropped, sanitised, or escaped by any responsible serialiser:
Surrogate pairs & non-BMP code points
Single-codepoint emoji: 😀 (U+1F600). Family ZWJ sequence: 👨👩👧👦 (man + ZWJ + woman + ZWJ + girl + ZWJ + boy — one rendered glyph, seven code points).
Skin-tone modifier: 👋🏽 (U+1F44B + U+1F3FD). Flag sequence: 🇦🇺 (regional indicator A + regional indicator U).
Mathematical alphanumeric: 𝕏 (U+1D54F), 𝓒𝓱𝓮𝓼𝓽𝓮𝓻 in script form.
Astral attribute carrier: hover for emoji-loaded title.
And surrogate pair inside a numeric entity:
😀 renders as 😀.
Malformed & ambiguous entities
AT&T
— HTML5 says this is a literal ampersand because
amp
is followed by a non-semicolon, non-alpha char.
Ambiguous-ampersand: foo&bar=1 in flowing prose
(should stay literal).
Followed-by-valid-entity: &© —
stray ampersand abutting a valid entity name.
Non-existent entity name: ¬arealentity; —
should be left as literal text by HTML5.
Numeric overflow: � — out-of-range
decimal numeric character reference.
Lone surrogate via numeric reference: � —
half of a surrogate pair, technically invalid Unicode.
Mixed-bag prose paragraph
A single paragraph that combines: an entity (Tom & Jerry),
a non-breaking space (a b), an em-dash (here—and there),
a U+202E override mid-string (filenamegpj.exe.txt),
a zero-width joiner (ab), a surrogate-pair emoji (😀),
a non-BMP letter (𝕏), an Arabic substring (مرحبا),
a NFD é (café decomposed), and a literal less-than escape:
<not-a-tag>. If any one of those round-trips
incorrectly, the rendered byte sequence diverges from source.
<pre> with literal angle brackets, tabs, BOM
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Tom & Jerry & Spike</title>
</head>
<body>
<p>Indented with a tab, leading BOM at top of pre.</p>
<p data-x=""quoted"">Doubly-escaped quotes.</p>
</body>
</html>
Boolean & valueless attributes
Three forms of the same boolean attribute on three buttons:
And a checkbox that is checked, required, and readonly via three different syntaxes: