UnlimitedZoomingInterface: HTML::Parser Class Reference

Detailed Description

            // This must be private per thread, as the parser is not reentrant.
            // Get a input stream 
            Stream::FileStream xFS("index.html");
            // Create an node allocator if not done yet (this can as simple as the "new" based allocator)
            HTML::Elements::Allocators::SimpleHeap xAllocator;
            // Create the parser now 
            HTML::Parser xParser(xFS, xAllocator, HTML::Parser::InstantParsing, HTML::LooseDTD);
            // Get a reference to the DOM tree now
            HTML::DOMTree & xTree = xParser.getDOMTree();
            // Do whatever you want with the DOM tree

Moreover, due to very bad structure of most document on the web, the parser is doing some cleaning like this: ( source idea : http://www.w3.org/People/Raggett/tidy/ )

Missing or mismatched end tags are detected and corrected

           <h1>heading
           <h2>subheading</h3>

is understood as:

           <h1>heading</h1>
           <h2>subheading</h2>

This only work for elements that are incorrectly placed (like block element (h2) in block element allowing only inline content (like H1 tag). In the previous example, if h3 had been opened before h1, then entering h1 would have closed it (implicitely).

This means that h2 couldn't be closed by h3, but is still closed here.

Warning:: This might not be the author intend, but that's how other browser handle it.

End tags in the wrong order are corrected:

            <p>here is a para <b>bold <i>bold italic</b> bold?</i> normal?

is understood as:

            <p>here is a para <b>bold <i>bold italic</i></b><i> bold?</i> normal?

This times, when the parser encounter inline-level end-of-bold tag instead of inline-level end-of-italic, the parser walk up the DOM tree, up to the first block level element. If it find a matching open-tag for the bad located end-tag, it then closes all the open-tag up to the matching open-tag, and re-open down all the missing tag from that matching open-tag.

Warning:: This is not what Tidy HTML does. There is no right solution here, except making the author write correct document.

Fixes problems with heading emphasis

                <h1><i>italic heading</h1>
                <p>new paragraph

is understood as:

                <h1><i>italic heading</i></h1>
                <p>new paragraph

This case is similar to the previous one. When block-level end-of-h1 tag is found instead of inline-level end-of-italic, the parser walk up the DOM tree, up to the first block level element. It then closes missing italic tag and then h1 tag. There is no need to reopen the italic tag because we are leaving the container block-level element.

Adding the missing "/" in end tags for anchors:

            <div><a href="#refs">References<a> below</div>

is understood as:

            <div><a href="#refs">References</a><a> below</a></div>

Anchor are not allowed in anchor, nor is span inside span or inline-level element inside itself. Because of this, the inline level tag is closed and another one is opened. It is then closed when encountering the container block level element


Public Types
enum	ParsingTime { InstantParsing = 0, DelayedParsing = 1 }
	The parsing time. More...
Public Member Functions
const HTML::DOMTree &	getDOMTree () const
	Get the DOM tree.
HTML::DOMTree &	getDOMTree ()
	Get the DOM tree.
const ParsingError &	getLastParsingError () const
	Return the last parsing error if any.
const unsigned char *	getUnboundedAccessToStream (const uint32 startPos)
	Get an unbounded memory version of source stream.
const ParsingError &	Parse ()
	Parse the given stream and update the DOM tree.
	Parser (Stream::InputStream &inputStreamRef, HTML::Elements::Allocators::BaseAllocator &allocatorRef, const ParsingTime &whenToParse=InstantParsing, const HTML::DTDType &dtd=HTML::StandardDTD)
	The parser constructor.
	~Parser ()
	The parser destructor.
Static Public Attributes
static const BuildRadix	radixElements
	The allowed elements radix shortcut.

Member Enumeration Documentation

enum HTML::Parser::ParsingTime

The parsing time.

Enumerator:

InstantParsing	Parse the stream when constructed.
DelayedParsing	Parse the stream only when accessed (or by calling parse() method).

Constructor & Destructor Documentation

HTML::Parser::Parser	(	Stream::InputStream &	inputStreamRef,
		HTML::Elements::Allocators::BaseAllocator &	allocatorRef,
		const ParsingTime &	whenToParse = `InstantParsing`,
		const HTML::DTDType &	dtd = `HTML::StandardDTD`
	)

The parser constructor.

Parameters:

inputStreamRef

The reference to the input stream (can be any stream,

See also:: Stream::InputFileStream, Stream::InputSocketStream

Parameters:

allocatorRef

The reference to the chosen allocator (can be any allocator,

See also:: HTML::Elements::Allocators::BaseAllocator, HTML::Elements::Allocators::SimpleHeap

Parameters:

whenToParse

When to parse the given stream

See also:: HTML::Parser::ParsingTime

Parameters:

dtd

What DTD to use when parsing the input stream

See also:: HTML::DTDType

HTML::Parser::~Parser ( ) [inline]

The parser destructor.

Member Function Documentation

const HTML::DOMTree& HTML::Parser::getDOMTree ( ) const [inline]

Get the DOM tree.

HTML::DOMTree& HTML::Parser::getDOMTree ( ) [inline]

Get the DOM tree.

const ParsingError& HTML::Parser::getLastParsingError ( ) const [inline]

Return the last parsing error if any.

const unsigned char* HTML::Parser::getUnboundedAccessToStream ( const uint32 startPos ) [inline]

Get an unbounded memory version of source stream.

Parameters:

startPos

The start position

Returns:: A (probably not aligned) pointer to the area specified or 0 if not accessible

const Parser::ParsingError & HTML::Parser::Parse ( )

Parse the given stream and update the DOM tree.

Returns:: A parsing error reference like as described in ParsingError

Member Data Documentation

const BuildRadix HTML::Parser::radixElements [static]

The allowed elements radix shortcut.

HTML::Parser Class Reference

Detailed Description

Public Types

Public Member Functions

Static Public Attributes

Member Enumeration Documentation

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation