Parsoid
Group: | Features |
Start: | |
End: | |
Team: | C. Scott Ananian, Arlo Breault, Marc Ordinas i Llopis |
Lead: | Subramanya Sastry |
Status: | See updates |
Parsoid is an application which can translate back and forth, at runtime, between MediaWiki's wikitext syntax and an equivalent HTML/RDFa document model with enhanced support for automated processing and rich editing. It has been under development by a team at the Wikimedia Foundation since 2012. It is currently used extensively by VisualEditor and Flow, as well as a growing list of other applications.
Parsoid is structured as a web service, and is written in JavaScript, making use of Node.js. It is intended to provide flawless back-and-forth conversion, i.e. to avoid both "dirty diffs" and any information loss.
For more on the overall project, see this blog post from March 2013. To read about the HTML model being used, see MediaWiki DOM spec. For current and future development information, see the project roadmap.
Contents
Getting started[edit | edit source]
Parsoid is a web-service implemented using node.js, often referred to simply as node. For a quick overview, you can test drive Parsoid using a node web service. Development happens in the Parsoid service in Git (see tree). If you need help, you can contact us in #mediawiki-parsoid or the wikitext-l mailing list.
If you use the MediaWiki-Vagrant development environment using a virtual machine, you can simply add the role visualeditor to it and it will set up a working Parsoid along with Extension:VisualEditor.
Parsoid setup[edit | edit source]
See Parsoid/Setup for detailed instructions.
Troubleshooting[edit | edit source]
Converting simple wikitext[edit | edit source]
You can convert simple wikitext snippets using our parse.js script:
cd tests echo '[[Foo]]' | node parse
More options are available with
node parse --help
The Parsoid web API[edit | edit source]
See Parsoid/API
Development[edit | edit source]
Code review happens in Gerrit. See Gerrit/Getting started and ping us in #mediawiki-parsoid.
Running the tests[edit | edit source]
To run all parser tests:
npm test
parserTests has quite a few options now which can be listed using node ./parserTests.js --help
.
An alternative wrapper taking wikitext on stdin and emitting HTML on stdout is modules/parser/parse.js:
cd tests echo '{{:Main Page}}' | node parse.js
This example will transclude the English Wikipedia's en:Main Page including its embedded templates. Also check out node parse.js --help
for options.
You can also try to round-trip a page and check for the significance of the differences. For example, try
cd tests node roundtrip-test.js --wiki mw Parsoid
This example will run the roundtripper on this page (the one you're reading, including all of this text) and report the results. It will also attempt to determine whether the differences in wikitext create any differences in the display of the page. If not, it reports the difference as "syntactic".
Finally, if you really wanted to hammer the Parsoid codebase to see how we're doing, you can try running the roundtrip testing environment on your computer with a list of titles.
As if that weren't enough, we've also added a --selser option, with multiple related options, to the parserTests.js script. The way it works:
cd tests node parserTests.js --selser
You can also write out change files, read them in, and specify any number of iterations of random changes to go through. There's also a plan to pass in actual changes to the tests, but those plans are still in progress.
Debugging Parsoid (for developers)[edit | edit source]
See Parsoid/Debugging for debugging tips.
Monthly high-level status summary[edit | edit source]
The GSoC 2014 LintTrap project also wrapped up and we hope to develop this furr.ther over the coming months and go live with it later this year.
With an eye towards supporting Parsoid-driven page views, the Parsoid team worked on a few different tracks -- we deployed the visual diff mass testing service @ http://parsoid-tests.wikimedia.org/visualdiff/, we added Tidy support to parser tests and updated tests which now makes it easy for Parsoid to target the PHP Parser + Tidy combo found in production, and continued to make CSS and other fixes.Todo[edit | edit source]
Our big plans are spelled out in some detail in our roadmap. Smaller-step tasks are tracked in our bug list.
If you have questions, try to ping the team on #mediawiki-parsoid
connect, or send a mail to the wikitext-l mailinglist. If all that fails, you can also contact Gabriel Wicke by mail.
Architecture[edit | edit source]
The broad architecture looks like this:
| wikitext V PEG wiki/HTML tokenizer (or other tokenizers / SAX-like parsers) | Chunks of tokens V Token stream transformations | Chunks of tokens V HTML5 tree builder | HTML 5 DOM tree V DOM Postprocessors | HTML5 DOM tree V (X)HTML serialization | +------------------> Browser | V VisualEditor
So basically a HTML parser pipeline, with the regular HTML tokenizer replaced by a combined Wiki/HTML tokenizer with additional functionality implemented as (mostly syntax-independent) token stream transformations.
- The PEG-based wiki tokenizer produces a combined token stream from wiki and html syntax. The PEG grammar is a context-free grammar that can be ported to different parser generators, mostly by adapting the parser actions to the target language. Currently we use pegjs to build the actual JavaScript tokenizer for us. We try to do as much work as possible in the grammar-based tokenizer, so that the emitted tokens are already mostly syntax-independent.
- Token stream transformations are used to implement context-sensitive wiki-specific functionality (wiki lists, quotes for italic/bold etc). Templates are also expanded at this stage, which makes it possible to still render unbalanced templates like table start / row / end combinations.
- The resulting tokens are then fed to a HTML5-spec compatible DOM tree builder (currently the 'html5' node.js module), which builds a HTML5 DOM tree from the token soup. This step already sanitizes nesting and enforces some content-model restrictions according to the rules of the HTML5 parsing spec.
- The resulting DOM is further manipulated using postprocessors. Currently, any remaining top-level inline content is wrapped into paragraphs in such a postprocessor. For output for viewing, further document model sanitation can be added here to get very close to what tidy does in the production parser.
- Finally, the DOM tree can be serialized as XML or HTML.
Technical documents[edit | edit source]
- Parsoid/Roadmap: What we are up to.
- Parsoid/MediaWiki DOM spec: Wiki content model spec using HTML/XML DOM and RDFa. The external interface for Parsoid, and designed to be useful as a future storage format.
- Parsoid/limitations: Limitations in Parsoid, mainly contrived templating (ab)uses that don't matter in practice. Could be extended to be similar to the preprocessor upgrade notes.
- Parsoid/Round-trip testing: The round-trip testing setup we are using to test the wikitext -> HTML DOM -> wikitext round-trip on actual Wikipedia content.
- Parsoid/Visual Diffs Testing: Info about visual diff testing for comparing parsoid's html rendering with php parser's html rendering + a testreduce setup for doing mass visual diff tests.
- /test cases: Please add interesting snippets or pages.
- If you feel masochistic, check out our broken wikitext tar pit.
See also[edit | edit source]
- Future/Parser plan: Early (now relatively old) design ideas and issues
- User:GWicke: Some notes on existing wiki and HTML parsers, should really be moved to general documentation
- Special:PrefixIndex/Parsoid/: Parsoid-related pages on this wiki