#include <Parser.hpp>
// This must be private per thread, as the parser is not reentrant. // Get a input stream Stream::FileStream xFS("index.html"); // Create an node allocator if not done yet (this can as simple as the "new" based allocator) HTML::Elements::Allocators::SimpleHeap xAllocator; // Create the parser now HTML::Parser xParser(xFS, xAllocator, HTML::Parser::InstantParsing, HTML::LooseDTD); // Get a reference to the DOM tree now HTML::DOMTree & xTree = xParser.getDOMTree(); // Do whatever you want with the DOM tree
Moreover, due to very bad structure of most document on the web, the parser is doing some cleaning like this: ( source idea : http://www.w3.org/People/Raggett/tidy/ )
<h1>heading
<h2>subheading</h3>
<h1>heading</h1>
<h2>subheading</h2>
This means that h2 couldn't be closed by h3, but is still closed here.
<p>here is a para <b>bold <i>bold italic</b> bold?</i> normal?
<p>here is a para <b>bold <i>bold italic</i></b><i> bold?</i> normal?
<h1><i>italic heading</h1>
<p>new paragraph
<h1><i>italic heading</i></h1>
<p>new paragraph
<div><a href="#refs">References<a> below</div>
<div><a href="#refs">References</a><a> below</a></div>
Public Types | |
| enum | ParsingTime { InstantParsing = 0, DelayedParsing = 1 } |
| The parsing time. More... | |
Public Member Functions | |
| const HTML::DOMTree & | getDOMTree () const |
| Get the DOM tree. | |
| HTML::DOMTree & | getDOMTree () |
| Get the DOM tree. | |
| const ParsingError & | getLastParsingError () const |
| Return the last parsing error if any. | |
| const unsigned char * | getUnboundedAccessToStream (const uint32 startPos) |
| Get an unbounded memory version of source stream. | |
| const ParsingError & | Parse () |
| Parse the given stream and update the DOM tree. | |
| Parser (Stream::InputStream &inputStreamRef, HTML::Elements::Allocators::BaseAllocator &allocatorRef, const ParsingTime &whenToParse=InstantParsing, const HTML::DTDType &dtd=HTML::StandardDTD) | |
| The parser constructor. | |
| ~Parser () | |
| The parser destructor. | |
Static Public Attributes | |
| static const BuildRadix | radixElements |
| The allowed elements radix shortcut. | |
| HTML::Parser::Parser | ( | Stream::InputStream & | inputStreamRef, | |
| HTML::Elements::Allocators::BaseAllocator & | allocatorRef, | |||
| const ParsingTime & | whenToParse = InstantParsing, |
|||
| const HTML::DTDType & | dtd = HTML::StandardDTD | |||
| ) |
The parser constructor.
| inputStreamRef | The reference to the input stream (can be any stream, |
| allocatorRef | The reference to the chosen allocator (can be any allocator, |
| whenToParse | When to parse the given stream |
| dtd | What DTD to use when parsing the input stream |
| HTML::Parser::~Parser | ( | ) | [inline] |
The parser destructor.
| const HTML::DOMTree& HTML::Parser::getDOMTree | ( | ) | const [inline] |
Get the DOM tree.
| HTML::DOMTree& HTML::Parser::getDOMTree | ( | ) | [inline] |
Get the DOM tree.
| const ParsingError& HTML::Parser::getLastParsingError | ( | ) | const [inline] |
Return the last parsing error if any.
| const unsigned char* HTML::Parser::getUnboundedAccessToStream | ( | const uint32 | startPos | ) | [inline] |
Get an unbounded memory version of source stream.
| startPos | The start position |
| const Parser::ParsingError & HTML::Parser::Parse | ( | ) |
Parse the given stream and update the DOM tree.
const BuildRadix HTML::Parser::radixElements [static] |
The allowed elements radix shortcut.
