NAME

XML::DOM - A perl module for building DOM Level 1 compliant document
structures

----------------------------------------------------------------------------

SYNOPSIS

 use XML::DOM;

 my $parser = new XML::DOM::Parser;
 my $doc = $parser->parsefile ("file.xml");

 # print all HREF attributes of all CODEBASE elements
 my $nodes = $doc->getElementsByTagName ("CODEBASE");
 my $n = $nodes->getLength;

 for (my $i = 0; $i < $n; $i++)
 {
     my $node = $nodes->item ($i);
     my $href = $node->getAttributeNode ("HREF");
     print $href->getValue . "\n";
 }

 # Print doc file
 $doc->printToFile ("out.xml");

 # Print to string
 print $doc->toString;

 # Avoid memory leaks - cleanup circular references for garbage collection
 $doc->dispose;

----------------------------------------------------------------------------

DESCRIPTION

This module extends the XML::Parser module by Clark Cooper. The XML::Parser
module is built on top of XML::Parser::Expat, which is a lower level
interface to James Clark's expat library.

XML::DOM::Parser is derived from XML::Parser. It parses XML strings or files
and builds a data structure that conforms to the API of the Document Object
Model as described at http://www.w3.org/TR/REC-DOM-Level-1. See the
XML::Parser manpage for other available features of the XML::DOM::Parser
class. Note that the 'Style' property should not be used (it is set
internally.)

The XML::Parser NoExpand option is more or less supported, in that it will
generate EntityReference objects whenever an entity reference is encountered
in character data. I'm not sure how useful this is. Any comments are
welcome.

As described in the synopsis, when you create an XML::DOM::Parser object,
the parse and parsefile methods create an XML::DOM::Document object from the
specified input. This Document object can then be examined, modified and
written back out to a file or converted to a string.

When using XML::DOM with XML::Parser version 2.19 and up, setting the
XML::DOM::Parser option KeepCDATA to 1 will store CDATASections in
CDATASection nodes, instead of converting them to Text nodes. Subsequent
CDATASection nodes will be merged into one. Let me know if this is a
problem.

When using XML::Parser 2.27 and above, you can suppress expansion of
parameter entity references (e.g. %pent;) in the DTD, by setting
ParseParamEnt to 1 and ExpandParamEnt to 0. See Hidden Nodes for details.

A Document has a tree structure consisting of Node objects. A Node may
contain other nodes, depending on its type. A Document may have Element,
Text, Comment, and CDATASection nodes. Element nodes may have Attr, Element,
Text, Comment, and CDATASection nodes. The other nodes may not have any
child nodes.

This module adds several node types that are not part of the DOM spec (yet.)
These are: ElementDecl (for <!ELEMENT ...> declarations), AttlistDecl (for
<!ATTLIST ...> declarations), XMLDecl (for <?xml ...?> declarations) and
AttDef (for attribute definitions in an AttlistDecl.)

----------------------------------------------------------------------------

XML::DOM Classes

The XML::DOM module stores XML documents in a tree structure with a root
node of type XML::DOM::Document. Different nodes in tree represent different
parts of the XML file. The DOM Level 1 Specification defines the following
node types:

   * XML::DOM::Node - Super class of all node types
   * XML::DOM::Document - The root of the XML document
   * XML::DOM::DocumentType - Describes the document structure: <!DOCTYPE
     root [ ... ]>
   * XML::DOM::Element - An XML element: <elem attr="val"> ... </elem>
   * XML::DOM::Attr - An XML element attribute: name="value"
   * XML::DOM::CharacterData - Super class of Text, Comment and CDATASection
   * XML::DOM::Text - Text in an XML element
   * XML::DOM::CDATASection - Escaped block of text: <![CDATA[ text ]]>
   * XML::DOM::Comment - An XML comment: <!-- comment -->
   * XML::DOM::EntityReference - Refers to an ENTITY: &ent; or %ent;
   * XML::DOM::Entity - An ENTITY definition: <!ENTITY ...>
   * XML::DOM::ProcessingInstruction - <?PI target>
   * XML::DOM::DocumentFragment - Lightweight node for cut & paste
   * XML::DOM::Notation - An NOTATION definition: <!NOTATION ...>

In addition, the XML::DOM module contains the following nodes that are not
part of the DOM Level 1 Specification:

   * XML::DOM::ElementDecl - Defines an element: <!ELEMENT ...>
   * XML::DOM::AttlistDecl - Defines one or more attributes in an <!ATTLIST
     ...>
   * XML::DOM::AttDef - Defines one attribute in an <!ATTLIST ...>
   * XML::DOM::XMLDecl - An XML declaration: <?xml version="1.0" ...>

Other classes that are part of the DOM Level 1 Spec:

   * XML::DOM::Implementation - Provides information about this
     implementation. Currently it doesn't do much.
   * XML::DOM::NodeList - Used internally to store a node's child nodes.
     Also returned by getElementsByTagName.
   * XML::DOM::NamedNodeMap - Used internally to store an element's
     attributes.

Other classes that are not part of the DOM Level 1 Spec:

   * XML::DOM::Parser - An non-validating XML parser that creates
     XML::DOM::Documents
   * XML::DOM::ValParser - A validating XML parser that creates
     XML::DOM::Documents. It uses XML::Checker to check against the
     DocumentType (DTD)
   * XML::Handler::BuildDOM - A PerlSAX handler that creates
     XML::DOM::Documents.

----------------------------------------------------------------------------

XML::DOM package

Constant definitions
     The following predefined constants indicate which type of node it is.

 UNKNOWN_NODE (0)                The node type is unknown (not part of DOM)

 ELEMENT_NODE (1)                The node is an Element.
 ATTRIBUTE_NODE (2)              The node is an Attr.
 TEXT_NODE (3)                   The node is a Text node.
 CDATA_SECTION_NODE (4)          The node is a CDATASection.
 ENTITY_REFERENCE_NODE (5)       The node is an EntityReference.
 ENTITY_NODE (6)                 The node is an Entity.
 PROCESSING_INSTRUCTION_NODE (7) The node is a ProcessingInstruction.
 COMMENT_NODE (8)                The node is a Comment.
 DOCUMENT_NODE (9)               The node is a Document.
 DOCUMENT_TYPE_NODE (10)         The node is a DocumentType.
 DOCUMENT_FRAGMENT_NODE (11)     The node is a DocumentFragment.
 NOTATION_NODE (12)              The node is a Notation.

 ELEMENT_DECL_NODE (13)          The node is an ElementDecl (not part of DOM)
 ATT_DEF_NODE (14)               The node is an AttDef (not part of DOM)
 XML_DECL_NODE (15)              The node is an XMLDecl (not part of DOM)
 ATTLIST_DECL_NODE (16)          The node is an AttlistDecl (not part of DOM)

 Usage:

   if ($node->getNodeType == ELEMENT_NODE)
   {
       print "It's an Element";
   }

Not In DOM Spec: The DOM Spec does not mention UNKNOWN_NODE and, quite
frankly, you should never encounter it. The last 4 node types were added to
support the 4 added node classes.

Global Variables

$VERSION
     The variable $XML::DOM::VERSION contains the version number of this
     implementation, e.g. "1.07".

METHODS

These methods are not part of the DOM Level 1 Specification.

getIgnoreReadOnly and ignoreReadOnly (readOnly)
     The DOM Level 1 Spec does not allow you to edit certain sections of the
     document, e.g. the DocumentType, so by default this implementation
     throws DOMExceptions (i.e. NO_MODIFICATION_ALLOWED_ERR) when you try to
     edit a readonly node. These readonly checks can be disabled by
     (temporarily) setting the global IgnoreReadOnly flag.

     The ignoreReadOnly method sets the global IgnoreReadOnly flag and
     returns its previous value. The getIgnoreReadOnly method simply returns
     its current value.

      my $oldIgnore = XML::DOM::ignoreReadOnly (1);
      eval {
      ... do whatever you want, catching any other exceptions ...
      };
      XML::DOM::ignoreReadOnly ($oldIgnore);     # restore previous value

     Another way to do it, using a local variable:

      { # start new scope
         local $XML::DOM::IgnoreReadOnly = 1;
         ... do whatever you want, don't worry about exceptions ...
      } # end of scope ($IgnoreReadOnly is set back to its previous value)


isValidName (name)
     Whether the specified name is a valid "Name" as specified in the XML
     spec. Characters with Unicode values > 127 are now also supported.

getAllowReservedNames and allowReservedNames (boolean)
     The first method returns whether reserved names are allowed. The second
     takes a boolean argument and sets whether reserved names are allowed.
     The initial value is 1 (i.e. allow reserved names.)

     The XML spec states that "Names" starting with (X|x)(M|m)(L|l) are
     reserved for future use. (Amusingly enough, the XML version of the XML
     spec (REC-xml-19980210.xml) breaks that very rule by defining an ENTITY
     with the name 'xmlpio'.) A "Name" in this context means the Name token
     as found in the BNF rules in the XML spec.

     XML::DOM only checks for errors when you modify the DOM tree, not when
     the DOM tree is built by the XML::DOM::Parser.

setTagCompression (funcref)
     There are 3 possible styles for printing empty Element tags:

     Style 0

      <empty/> or <empty attr="val"/>

          XML::DOM uses this style by default for all Elements.

     Style 1

       <empty></empty> or <empty attr="val"></empty>

     Style 2

       <empty /> or <empty attr="val" />

          This style is sometimes desired when using XHTML. (Note the extra
          space before the slash "/") See http://www.w3.org/TR/xhtml1
          Appendix C for more details.

     By default XML::DOM compresses all empty Element tags (style 0.) You
     can control which style is used for a particular Element by calling
     XML::DOM::setTagCompression with a reference to a function that takes 2
     arguments. The first is the tag name of the Element, the second is the
     XML::DOM::Element that is being printed. The function should return 0,
     1 or 2 to indicate which style should be used to print the empty tag.
     E.g.

      XML::DOM::setTagCompression (\&my_tag_compression);

      sub my_tag_compression
      {
         my ($tag, $elem) = @_;

         # Print empty br, hr and img tags like this: <br />
         return 2 if $tag =~ /^(br|hr|img)$/;

         # Print other empty tags like this: <empty></empty>
         return 1;
      }

----------------------------------------------------------------------------

IMPLEMENTATION DETAILS

* Perl Mappings
     The value undef was used when the DOM Spec said null.

     The DOM Spec says: Applications must encode DOMString using UTF-16
     (defined in Appendix C.3 of [UNICODE] and Amendment 1 of [ISO-10646]).
     In this implementation we use plain old Perl strings encoded in UTF-8
     instead of UTF-16.

* Text and CDATASection nodes
     The Expat parser expands EntityReferences and CDataSection sections to
     raw strings and does not indicate where it was found. This
     implementation does therefore convert both to Text nodes at parse time.
     CDATASection and EntityReference nodes that are added to an existing
     Document (by the user) will be preserved.

     Also, subsequent Text nodes are always merged at parse time. Text nodes
     that are added later can be merged with the normalize method. Consider
     using the addText method when adding Text nodes.

* Printing and toString
     When printing (and converting an XML Document to a string) the strings
     have to encoded differently depending on where they occur. E.g. in a
     CDATASection all substrings are allowed except for "]]>". In regular
     text, certain characters are not allowed, e.g. ">" has to be converted
     to "&gt;". These routines should be verified by someone who knows the
     details.

* Quotes
     Certain sections in XML are quoted, like attribute values in an
     Element. XML::Parser strips these quotes and the print methods in this
     implementation always uses double quotes, so when parsing and printing
     a document, single quotes may be converted to double quotes. The
     default value of an attribute definition (AttDef) in an AttlistDecl,
     however, will maintain its quotes.

* AttlistDecl
     Attribute declarations for a certain Element are always merged into a
     single AttlistDecl object.

* Comments
     Comments in the DOCTYPE section are not kept in the right place. They
     will become child nodes of the Document.

* Hidden Nodes
     Previous versions of XML::DOM would expand parameter entity references
     (like %pent;), so when printing the DTD, it would print the contents of
     the external entity, instead of the parameter entity reference. With
     this release (1.27), you can prevent this by setting the
     XML::DOM::Parser options ParseParamEnt => 1 and ExpandParamEnt => 0.

     When it is parsing the contents of the external entities, it *DOES*
     still add the nodes to the DocumentType, but it marks these nodes by
     setting the 'Hidden' property. In addition, it adds an EntityReference
     node to the DocumentType node.

     When printing the DocumentType node (or when using to_expat() or
     to_sax()), the 'Hidden' nodes are suppressed, so you will see the
     parameter entity reference instead of the contents of the external
     entities. See test case t/dom_extent.t for an example.

     The reason for adding the 'Hidden' nodes to the DocumentType node, is
     that the nodes may contain <!ENTITY> definitions that are referenced
     further in the document. (Simply not adding the nodes to the
     DocumentType could cause such entity references to be expanded
     incorrectly.)

     Note that you need XML::Parser 2.27 or higher for this to work
     correctly.

----------------------------------------------------------------------------

SEE ALSO

The Japanese version of this document by Takanori Kawai (Hippo2000) at
http://member.nifty.ne.jp/hippo2000/perltips/xml/dom.htm

The DOM Level 1 specification at http://www.w3.org/TR/REC-DOM-Level-1

The XML spec (Extensible Markup Language 1.0) at
http://www.w3.org/TR/REC-xml

The XML::Parser and XML::Parser::Expat manual pages.

----------------------------------------------------------------------------

CAVEATS

The method getElementsByTagName() does not return a "live" NodeList. Whether
this is an actual caveat is debatable, but a few people on the www-dom
mailing list seemed to think so. I haven't decided yet. It's a pain to
implement, it slows things down and the benefits seem marginal. Let me know
what you think.

(To subscribe to the www-dom mailing list send an email with the subject
"subscribe" to www-dom-request@w3.org. I only look here occasionally, so
don't send bug reports or suggestions about XML::DOM to this list, send them
to enno@att.com instead.)

----------------------------------------------------------------------------

AUTHOR

Send bug reports, hints, tips, suggestions to Enno Derksen at
<enno@att.com>.

Thanks to Clark Cooper for his help with the initial version.

----------------------------------------------------------------------------
Last updated: Wed Feb 23 13:37:18 2000