#include <Parser.hpp>
// This must be private per thread, as the parser is not reentrant. // Get a input stream Stream::FileStream xFS("index.html"); // Create an node allocator if not done yet (this can as simple as the "new" based allocator) HTML::Elements::Allocators::SimpleHeap xAllocator; // Create the parser now HTML::Parser xParser(xFS, xAllocator, HTML::Parser::InstantParsing, HTML::LooseDTD); // Get a reference to the DOM tree now HTML::DOMTree & xTree = xParser.getDOMTree(); // Do whatever you want with the DOM tree
Moreover, due to very bad structure of most document on the web, the parser is doing some cleaning like this: ( source idea : http://www.w3.org/People/Raggett/tidy/ )
<h1>heading <h2>subheading</h3>
<h1>heading</h1> <h2>subheading</h2>
This means that h2 couldn't be closed by h3, but is still closed here.
<p>here is a para <b>bold <i>bold italic</b> bold?</i> normal?
<p>here is a para <b>bold <i>bold italic</i></b><i> bold?</i> normal?
<h1><i>italic heading</h1> <p>new paragraph
<h1><i>italic heading</i></h1> <p>new paragraph
<div><a href="#refs">References<a> below</div>
<div><a href="#refs">References</a><a> below</a></div>
Public Types | |
enum | ParsingTime { InstantParsing = 0, DelayedParsing = 1 } |
The parsing time. More... | |
Public Member Functions | |
const HTML::DOMTree & | getDOMTree () const |
Get the DOM tree. | |
HTML::DOMTree & | getDOMTree () |
Get the DOM tree. | |
const ParsingError & | getLastParsingError () const |
Return the last parsing error if any. | |
const unsigned char * | getUnboundedAccessToStream (const uint32 startPos) |
Get an unbounded memory version of source stream. | |
const ParsingError & | Parse () |
Parse the given stream and update the DOM tree. | |
Parser (Stream::InputStream &inputStreamRef, HTML::Elements::Allocators::BaseAllocator &allocatorRef, const ParsingTime &whenToParse=InstantParsing, const HTML::DTDType &dtd=HTML::StandardDTD) | |
The parser constructor. | |
~Parser () | |
The parser destructor. | |
Static Public Attributes | |
static const BuildRadix | radixElements |
The allowed elements radix shortcut. |
HTML::Parser::Parser | ( | Stream::InputStream & | inputStreamRef, | |
HTML::Elements::Allocators::BaseAllocator & | allocatorRef, | |||
const ParsingTime & | whenToParse = InstantParsing , |
|||
const HTML::DTDType & | dtd = HTML::StandardDTD | |||
) |
The parser constructor.
inputStreamRef | The reference to the input stream (can be any stream, |
allocatorRef | The reference to the chosen allocator (can be any allocator, |
whenToParse | When to parse the given stream |
dtd | What DTD to use when parsing the input stream |
HTML::Parser::~Parser | ( | ) | [inline] |
The parser destructor.
const HTML::DOMTree& HTML::Parser::getDOMTree | ( | ) | const [inline] |
Get the DOM tree.
HTML::DOMTree& HTML::Parser::getDOMTree | ( | ) | [inline] |
Get the DOM tree.
const ParsingError& HTML::Parser::getLastParsingError | ( | ) | const [inline] |
Return the last parsing error if any.
const unsigned char* HTML::Parser::getUnboundedAccessToStream | ( | const uint32 | startPos | ) | [inline] |
Get an unbounded memory version of source stream.
startPos | The start position |
const Parser::ParsingError & HTML::Parser::Parse | ( | ) |
Parse the given stream and update the DOM tree.
const BuildRadix HTML::Parser::radixElements [static] |
The allowed elements radix shortcut.