HTML::Parser Class Reference

#include <Parser.hpp>

List of all members.


Detailed Description

The parser class Instantiate it like this:.

            // This must be private per thread, as the parser is not reentrant.
            // Get a input stream 
            Stream::FileStream xFS("index.html");
            // Create an node allocator if not done yet (this can as simple as the "new" based allocator)
            HTML::Elements::Allocators::SimpleHeap xAllocator;
            // Create the parser now 
            HTML::Parser xParser(xFS, xAllocator, HTML::Parser::InstantParsing, HTML::LooseDTD);
            // Get a reference to the DOM tree now
            HTML::DOMTree & xTree = xParser.getDOMTree();
            // Do whatever you want with the DOM tree

Moreover, due to very bad structure of most document on the web, the parser is doing some cleaning like this: ( source idea : http://www.w3.org/People/Raggett/tidy/ )

  1. Missing or mismatched end tags are detected and corrected
               <h1>heading
               <h2>subheading</h3>
            
    is understood as:
               <h1>heading</h1>
               <h2>subheading</h2>
            
    This only work for elements that are incorrectly placed (like block element (h2) in block element allowing only inline content (like H1 tag). In the previous example, if h3 had been opened before h1, then entering h1 would have closed it (implicitely).

    This means that h2 couldn't be closed by h3, but is still closed here.

    Warning:
    This might not be the author intend, but that's how other browser handle it.
  2. End tags in the wrong order are corrected:
                <p>here is a para <b>bold <i>bold italic</b> bold?</i> normal?
            
    is understood as:
                <p>here is a para <b>bold <i>bold italic</i></b><i> bold?</i> normal?
            
    This times, when the parser encounter inline-level end-of-bold tag instead of inline-level end-of-italic, the parser walk up the DOM tree, up to the first block level element. If it find a matching open-tag for the bad located end-tag, it then closes all the open-tag up to the matching open-tag, and re-open down all the missing tag from that matching open-tag.

    Warning:
    This is not what Tidy HTML does. There is no right solution here, except making the author write correct document.
  3. Fixes problems with heading emphasis
                    <h1><i>italic heading</h1>
                    <p>new paragraph
            
    is understood as:
                    <h1><i>italic heading</i></h1>
                    <p>new paragraph
            
    This case is similar to the previous one. When block-level end-of-h1 tag is found instead of inline-level end-of-italic, the parser walk up the DOM tree, up to the first block level element. It then closes missing italic tag and then h1 tag. There is no need to reopen the italic tag because we are leaving the container block-level element.

  4. Adding the missing "/" in end tags for anchors:
                <div><a href="#refs">References<a> below</div>
            
    is understood as:
                <div><a href="#refs">References</a><a> below</a></div>
            
    Anchor are not allowed in anchor, nor is span inside span or inline-level element inside itself. Because of this, the inline level tag is closed and another one is opened. It is then closed when encountering the container block level element


Public Types

enum  ParsingTime { InstantParsing = 0, DelayedParsing = 1 }
 The parsing time. More...

Public Member Functions

const HTML::DOMTreegetDOMTree () const
 Get the DOM tree.
HTML::DOMTreegetDOMTree ()
 Get the DOM tree.
const ParsingError & getLastParsingError () const
 Return the last parsing error if any.
const unsigned char * getUnboundedAccessToStream (const uint32 startPos)
 Get an unbounded memory version of source stream.
const ParsingError & Parse ()
 Parse the given stream and update the DOM tree.
 Parser (Stream::InputStream &inputStreamRef, HTML::Elements::Allocators::BaseAllocator &allocatorRef, const ParsingTime &whenToParse=InstantParsing, const HTML::DTDType &dtd=HTML::StandardDTD)
 The parser constructor.
 ~Parser ()
 The parser destructor.

Static Public Attributes

static const BuildRadix radixElements
 The allowed elements radix shortcut.


Member Enumeration Documentation

enum HTML::Parser::ParsingTime

The parsing time.

Enumerator:
InstantParsing  Parse the stream when constructed.
DelayedParsing  Parse the stream only when accessed (or by calling parse() method).


Constructor & Destructor Documentation

HTML::Parser::Parser ( Stream::InputStream inputStreamRef,
HTML::Elements::Allocators::BaseAllocator allocatorRef,
const ParsingTime whenToParse = InstantParsing,
const HTML::DTDType dtd = HTML::StandardDTD 
)

The parser constructor.

Parameters:
inputStreamRef The reference to the input stream (can be any stream,
See also:
Stream::InputFileStream, Stream::InputSocketStream
Parameters:
allocatorRef The reference to the chosen allocator (can be any allocator,
See also:
HTML::Elements::Allocators::BaseAllocator, HTML::Elements::Allocators::SimpleHeap
Parameters:
whenToParse When to parse the given stream
See also:
HTML::Parser::ParsingTime
Parameters:
dtd What DTD to use when parsing the input stream
See also:
HTML::DTDType

HTML::Parser::~Parser (  )  [inline]

The parser destructor.


Member Function Documentation

const HTML::DOMTree& HTML::Parser::getDOMTree (  )  const [inline]

Get the DOM tree.

HTML::DOMTree& HTML::Parser::getDOMTree (  )  [inline]

Get the DOM tree.

const ParsingError& HTML::Parser::getLastParsingError (  )  const [inline]

Return the last parsing error if any.

const unsigned char* HTML::Parser::getUnboundedAccessToStream ( const uint32  startPos  )  [inline]

Get an unbounded memory version of source stream.

Parameters:
startPos The start position
Returns:
A (probably not aligned) pointer to the area specified or 0 if not accessible

const Parser::ParsingError & HTML::Parser::Parse (  ) 

Parse the given stream and update the DOM tree.

Returns:
A parsing error reference like as described in ParsingError


Member Data Documentation

const BuildRadix HTML::Parser::radixElements [static]

The allowed elements radix shortcut.


The documentation for this class was generated from the following files:

(C) An X-Ryl669 project 2007

This document describes Unlimited Zooming Interface source code. UZI stands for Unlimited Zooming Interface, and source code license is