|
|
|
|
Tokenizer tests
|
|
|
|
|
===============
|
|
|
|
|
|
|
|
|
|
The test format is [JSON](http://www.json.org/). This has the advantage
|
|
|
|
|
that the syntax allows backward-compatible extensions to the tests and
|
|
|
|
|
the disadvantage that it is relatively verbose.
|
|
|
|
|
|
|
|
|
|
Basic Structure
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
{"tests": [
|
|
|
|
|
{"description": "Test description",
|
|
|
|
|
"input": "input_string",
|
|
|
|
|
"output": [expected_output_tokens],
|
|
|
|
|
"initialStates": [initial_states],
|
|
|
|
|
"lastStartTag": last_start_tag,
|
|
|
|
|
"ignoreErrorOrder": ignore_error_order
|
|
|
|
|
}
|
|
|
|
|
]}
|
|
|
|
|
|
|
|
|
|
Multiple tests per file are allowed simply by adding more objects to the
|
|
|
|
|
"tests" list.
|
|
|
|
|
|
|
|
|
|
`description`, `input` and `output` are always present. The other values
|
|
|
|
|
are optional.
|
|
|
|
|
|
|
|
|
|
### Test set-up
|
|
|
|
|
|
|
|
|
|
`test.input` is a string containing the characters to pass to the
|
|
|
|
|
tokenizer. Specifically, it represents the characters of the **input
|
|
|
|
|
stream**, and so implementations are expected to perform the processing
|
|
|
|
|
described in the spec's **Preprocessing the input stream** section
|
|
|
|
|
before feeding the result to the tokenizer.
|
|
|
|
|
|
|
|
|
|
If `test.doubleEscaped` is present and `true`, then `test.input` is not
|
|
|
|
|
quite as described above. Instead, it must first be subjected to another
|
|
|
|
|
round of unescaping (i.e., in addition to any unescaping involved in the
|
|
|
|
|
JSON import), and the result of *that* represents the characters of the
|
|
|
|
|
input stream. Currently, the only unescaping required by this option is
|
|
|
|
|
to convert each sequence of the form \\uHHHH (where H is a hex digit)
|
|
|
|
|
into the corresponding Unicode code point. (Note that this option also
|
|
|
|
|
affects the interpretation of `test.output`.)
|
|
|
|
|
|
|
|
|
|
`test.initialStates` is a list of strings, each being the name of a
|
|
|
|
|
tokenizer state. The test should be run once for each string, using it
|
|
|
|
|
to set the tokenizer's initial state for that run. If
|
|
|
|
|
`test.initialStates` is omitted, it defaults to `["data state"]`.
|
|
|
|
|
|
|
|
|
|
`test.lastStartTag` is a lowercase string that should be used as "the
|
|
|
|
|
tag name of the last start tag to have been emitted from this
|
|
|
|
|
tokenizer", referenced in the spec's definition of **appropriate end tag
|
|
|
|
|
token**. If it is omitted, it is treated as if "no start tag has been
|
|
|
|
|
emitted from this tokenizer".
|
|
|
|
|
|
|
|
|
|
### Test results
|
|
|
|
|
|
|
|
|
|
`test.output` is a list of tokens, ordered with the first produced by
|
|
|
|
|
the tokenizer the first (leftmost) in the list. The list must mach the
|
|
|
|
|
**complete** list of tokens that the tokenizer should produce. Valid
|
|
|
|
|
tokens are:
|
|
|
|
|
|
|
|
|
|
["DOCTYPE", name, public_id, system_id, correctness]
|
|
|
|
|
["StartTag", name, {attributes}*, true*]
|
|
|
|
|
["StartTag", name, {attributes}]
|
|
|
|
|
["EndTag", name]
|
|
|
|
|
["Comment", data]
|
|
|
|
|
["Character", data]
|
|
|
|
|
"ParseError"
|
|
|
|
|
|
|
|
|
|
`public_id` and `system_id` are either strings or `null`. `correctness`
|
|
|
|
|
is either `true` or `false`; `true` corresponds to the force-quirks flag
|
|
|
|
|
being false, and vice-versa.
|
|
|
|
|
|
|
|
|
|
When the self-closing flag is set, the `StartTag` array has `true` as
|
|
|
|
|
its fourth entry. When the flag is not set, the array has only three
|
|
|
|
|
entries for backwards compatibility.
|
|
|
|
|
|
|
|
|
|
All adjacent character tokens are coalesced into a single
|
|
|
|
|
`["Character", data]` token.
|
|
|
|
|
|
|
|
|
|
If `test.doubleEscaped` is present and `true`, then every string within
|
|
|
|
|
`test.output` must be further unescaped (as described above) before
|
|
|
|
|
comparing with the tokenizer's output.
|
|
|
|
|
|
|
|
|
|
`test.ignoreErrorOrder` is a boolean value indicating that the order of
|
|
|
|
|
`ParseError` tokens relative to other tokens in the output stream is
|
|
|
|
|
unimportant, and implementations should ignore such differences between
|
|
|
|
|
their output and `expected_output_tokens`. (This is used for errors
|
|
|
|
|
emitted by the input stream preprocessing stage, since it is useful to
|
|
|
|
|
test that code but it is undefined when the errors occur). If it is
|
|
|
|
|
omitted, it defaults to `false`.
|
|
|
|
|
|
|
|
|
|
xmlViolation tests
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
`tokenizer/xmlViolation.test` differs from the above in a couple of
|
|
|
|
|
ways:
|
|
|
|
|
|
|
|
|
|
- The name of the single member of the top-level JSON object is
|
|
|
|
|
"xmlViolationTests" instead of "tests".
|
|
|
|
|
- Each test's expected output assumes that implementation is applying
|
|
|
|
|
the tweaks given in the spec's "Coercing an HTML DOM into an
|
|
|
|
|
infoset" section.
|
|
|
|
|
|