You are on page 1of 18

Differences Between HTML and XHTML

This page is currently being revised. Some information is incomplete or missing. Please note that the information in here is based upon the current spec for (X)HTML5. Some of the issues technically do not apply to previous versions of HTML. Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways. Note: As the current WHATWG document is a draft, this section will need to track to a moving target.

Overlap Language
There is a community who find it valuable to be able to serve HTML5 documents which are also valid XML documents. They may, for example, use XML tools to generate the document, and they and others may process the document using XML tools. These documents are served as text/html. This language is sometimes called "polyglot". It is the overlap language of documents which are both HTML5 documents and XML documents. Guidelines are listed below for how one can construct such a polyglot document which will work in either environment. Besides following the well-formedness rules of XML, there are some other restrictions to which one must adhere (for the sake of text/html documents). This wiki web page is an example of such a document. You can parse it with an XML parser or an HTML parser.

MIME Types
Feature HTML Requirement XHTML Requirement Notes It is the MIME type that determines what type of document you are using. Any document, including a document authored with the intention of being XHTML, served as text/html is technically an HTML document.

Mime Type

Must use
text/html.

Must use an XML MIME type, such as application/xml or application/xhtml+xml.

Note that XHTML 1.0 previously defined that documents adhering to the compatibility guidelines were allowed to be served as text/html, but HTML 5 now defines that such documents are HTML, not XHTML.

Syntax and Parsing


XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today. The following table describes the differences between how each is parsed. The column on "Guidance for XHTML-HTML compatibility" lists ways in which a document can be crafted to work in either XHTML or HTML. The item will be bolded if it is a requirement for XHTML-compliant code to be changed, since XHTML will otherwise usually work as HTML, at least if its full features are constrained. Guidance for XHTMLFeature HTML Requirement XHTML Requirement Notes HTML compatibili ty Parsing Three parsing modes are XML parsing rules are used. The Use an Modes defined: no quirks mode, There is only one mode. parsing explicit <! quirks mode and limited modes DOCTYPE html> (case quirks mode. The mode is in only ever changed from the HTML insensitivel y) or default by the HTML parser, also based on the presence, have an legacyabsence, or value of the effect compat DOCTYPE string, upon version <! respectively. script DOCTYPE html and SYSTEM styleshe "about:le gacyet process compat"> for the ing. XHTM sake of HTML L is conside and thus red to trigger no be in quirks parsing. no quirks mode for these

purpose s. HTML does not have a wellformedness constraint, no Error errors are fatal. Graceful error Well-formedness errors are Handlin handling and recovery fatal g procedures are thoroughly defined. Charact The XML declaration is The XML declaration may be er forbidden (treated as a bogus used to specify the character Encodin comment, but such style of encoding, while meta is only g comments are deprecated), allowed as case-insensitive (includi but the meta element with a "UTF-8" (and is ignored if ng XML charset attribute may be included). Declara used instead. tion, The default character meta) If the encoding is unspecified encoding for XHTML is, according to XML rules, UTFin HTML, it should be determined through 8 or UTF-16. implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished). Ensure there are no wellformedness errors. If you need to include XML 1.1only markup, if you do not wish to convert the encoding of the document to UTF-8 or UTF-16 (since use of other encodings also requires a declaration ), or if you wish to define an external SYSTEM DTD in the DOCTYPE but use standalone =yes (redundant ?), you must use an XML Declaratio n for XHTML, but this

may not be allowable in the future in HTML. For future compatibili ty, it would be best to avoid XML 1.1only markup, convert to UTF-8 or UTF-16 (probably UTF-8 which could allow use of a meta tag), and avoid use of a SYSTEM DTD (rendering the standalone =yes unnecessar y), respectivel y. Do not use a meta tag, unless it is UTF-8 (and included in the first 512 bytes of the document), in which case it is probably a

Namesp Elements and attributes for aced known vocabularies (HTML, element SVG and MathML) are s implicitly assigned to appropriate namespaces, according to the rules specified in the parsing algorithm. Elements in the HTML, SVG, or MathML namespaces may have an xmlns attribute explicitly specified, if, and only if, it has the exact value

The XHTML namespace must be declared for HTML elements according to the rules defined by the Namespaces in XML specification. Namespaces must be explicitly declared. The xmlns attribute ends up in the

"http://www.w3.org/2000/ xmlns" namespace. Foreign

elements can be used independently of HTML "http://www.w3.org/1999/ elements, as long as they are xhtml" (see namespace assigned to their own declaration). The attribute has namespace. absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier. When parsed by an HTML parser, the xmlns attribute itself ends up in no namespace. Foreign elements are also not treated as being in another namespace and will have no effect except for displaying by default as inline elements (and be aware that selfclosing elements cannot be used as such since

good idea to include it for the sake of HTML (as <meta charset=" UTF-8">) in case you cannot specify such in a content header. Declare HTML namespaces (or other namespaces ) explicitly and do not prefix XHTML elements. Do not depend on the behavior of foreign namespace d elements in an HTML setting; if you need to include these, you will probably wish to set this foreign markup via CSS to
display:n one. You

should

unrecognized elements will be treated as though they are non-void; thus one cannot, for example, type <caesura /> in HTML or it will be treated as though there is no immediate closing tag). Namespaced prefixes are not allowed on HTML elements; a prefixed xmlns attribute cannot be used even if it is defined in the XHTML namespace.

Namesp aced The xmlns:prefix attributes attribut Attributes of the form es on xmlns:prefix may not be used end up in the "http://www.w3.org/2000/ HTML on HTML elements. xmlns" namespace. element s

explicitly close (not self-close) all empty elements defined in a nonXHTML namespace , since otherwise when used in HTML, HTML will treat them as though they have not been closed. Do not use namespace d attributes on HTML elements. Do not depend on the behavior of foreign attributes in an HTML setting.

Namesp Elements in the SVG The SVG and MathML ace namespace may have an namespaces must be declared attribut xmlns attribute specified, if, for SVG and MathML es on and only if, it has the exact elements, respectively, foreign value according to the rules defined "http://www.w3.org/2000/ by Namespaces in XML. The element svg". The attribute is optional xmlns and xmlns:prefix s because the namespace is attributes end up in the "http://www.w3.org/2000/ implied during parsing. xmlns" namespace. Elements in the MathML

namespace may have an xmlns attribute specified, if, and only if, it has the exact value

"http://www.w3.org/1998/ Math/MathML". The attribute

is optional because the namespace is implied during parsing. Foreign elements may also have an xmlns:xlink attribute specified, if, and only if, it has the exact value
"http://www.w3.org/1999/ xlink". This attribute is

optional, even if XLink attributes are used, because the namespaces for XLink attributes is implied during parsing. When parsed by an HTML parser, the xmlns and xmlns:xlink attributes end up in the

"http://www.w3.org/2000/ xmlns" namespace.

XLink Foreign elements may use the attribut attributes xlink:actuate, es xlink:arcrole, xlink:href, xlink:role, xlink:show, xlink:title and xlink:type. These attributes are placed in the

XLink attributes may be specified on foreign elements using any prefix, subject to the conformance rules defined by Namespaces in XML. The XLink namespace must be declared according to the "http://www.w3.org/1999/ conformance rules defined by xlink". The prefix used must Namespaces in XML if XLink be "xlink". attributes are used within the document.

Do not use XLink attributes on HTML elements and do not depend on them on foreign elements as will not work as such in HTML. If being used, ensure they have the

Foreign elements may use the attributes xml:lang, xml:id, xml:base and xml:space. These attributes are placed in the

Any element, including HTML elements, may use the attributes xml:lang, xml:id, XML xml:base and xml:space. HTML elements may use the These attributes are placed in attribut xml:lang attribute. The es the "http://www.w3.org/XML/1 attribute in no namespace with no prefix and with the 998/namespace". The prefix literal localname "xml:lang" used must be "xml". has no effect on language processing (as does "lang". HTML elements must not use the xml:base, xml:space, or xml:id attributes. Attribut Names are not case sensitive. Names are case sensitive (and es Attribute minimization is lower case). Attribute allowed (i.e. omitting the minimization is not allowed. equals sign and the value).

"http://www.w3.org/XML/1 998/namespace". The prefix used must be "xml".

appropriate XLink namespace defined. Though they can be used on foreign elements, do not use xml:base, xml:id, or
xml:space

on HTML elements; use both xml:lang and lang attributes whenever one is to be needed on HTML elements. Use lower case attribute names. Do not minimize attributes. Nonnamespaced attributes not belonging to HTML will be included in the DOM tree and accessible to script and stylesheets,

but it is discouraged to use these due to the potential for future naming conflicts;
data-

Attribut White space characters are e values not normalized. Unquoted attribute values are allowed. Fixed or default attribute values ...?

White space characters are normalized to single spaces (unless attribute is of CDATA type?). Unquoted attribute values are not allowed. Default attribute values could conceivably be defined with a DTD.

attributes can be used instead, or if in an XML-only environmen t, namespaced attributes. Create whitespace in attribute values which is already normalized (converted to single spaces). Always quote attribute values. Do not rely on defining default or fixed attribute values (or elements with exclusively element content) in a DTD (unless it matches

HTML behavior). The differen The space characters are ce is defined as: The space characters are the defined as: inclusio U+0009 n of CHARACTER U+0009 Form Do not use Space TABULATION CHARACTER Feed. the form charact U+000A LINE FEED TABULATION Form feed ers U+000C FORM U+000A LINE FEED feed character. FEED U+000D CARRIAGE charact U+000D CARRIAGE RETURN ers are RETURN discour U+0020 SPACE aged in U+0020 SPACE XML 1.1. The A DOCTYPE is a mostly The DOCTYPE is optional. Use the DOCTY useless, but required, header. XML rules for case sensitivity empty PE The DOCTYPE is used apply (everything is case DOCTYPE during parsing to determing sensitive). with no the parsing mode. The SYSTEM keywords "DOCTYPE", or Either of the DOCTYPEs PUBLIC "PUBLIC" and "SYSTEM", and defined in HTML5 may be identifiers the name "html" are treated used, or any other custom and no use case insensitively. The system DOCTYPE. If the public of internet identifier is specified, the identifier "about:legacysubset. compat" (and the public and system identifier must also be specified. The obsolete status system identifiers for previous versions of HTML) of the obsolete permitted DOCTYPEs defined for are case sensitive. HTML does not apply to XHTML. Any DOCTYPE Conforming HTML documents are required to use may be used, subject to the conformance rules defined by <!DOCTYPE html> (case insensitively) or the legacy- XML. compat version <!DOCTYPE html SYSTEM Use of an internal subset is "about:legacy-compat">. permitted according to the requirements of XML. Some When using the obsolete but validating XML processors may dereference the system conforming DOCTYPEs based on the HTML 4.0 and identifier, if used, but most 4.01 Strict DTDs, the system browsers use non-validating

identifier is optional. The obsolete but conforming DOCTYPEs based on XHTML 1.0 Strict and XHTML 1.1 may also be specified. Use of an internal subset is forbidden. The system identifier is never dereferenced by HTML implementations.

processors.

Element Element names are case names insensitive.

Element names are case sensitive and lower-case.

Void vs. Void elements only have a Void elements may use either Non- start tag; end tags must not be the empty-element tag syntax void specified for void elements, (EmptyElemTag) or use a start Element and it is impossible for them tag immediately followed by s to contain any content. A an end tag, with no content in trailing slash may optionally between. While it is possible be inserted at the end of the for the element to contain element's tag, immediately content, this is nonbefore the closing greaterconforming. than sign. For non-void elements (e.g., <script>), the trailing slash is a parsing error (ignored and thus treated as unclosed).

Only use lower-case element names (as with attributes). For void elements (e.g., <br />), do not include content or use a closing tag; only use a self-closing element with closing slash at the end (with a space preceding it for the sake of older browsers). For nonvoid elements, i.e., where content

Unexpected end tags (in Unexpe HTML, an unexpected </br> Unexpected end tags are wellcted end or </p> can cause the start tag formedness errors. tags to be implied before it).

End tag with ? attribut es Raw text element s RCDAT A element s Foreign element s Normal element s Optiona For some elements, the start l tags and/or end tags are optional and are implied by certain specified conditions. For example, the end tag for the p element is implied by a

An end tag with attributes is not allowed.

can exist (e.g., <script>), always use an explicit closing tag (not a selfclosing tag) even if there is no content. Do not add end tags unless there is an explicit and properly nested open tag before it. Do not use end tags with attributes.

End tags must be explicitly included for all elements, except empty elements using the EmptyElemTag syntax.

Always use end tags (or self-closing tags for void elements).

subsequent p element. Omitting the end tag for other elements is a parse error and various error recovery procedures are applied appropriately. Comme Comments must start with the The content of comments nt four character sequence must not contain two syntax "<!--" and must be ended by consecutive U+002D the three character sequence HYPHEN-MINUS (-) "-->" (bogus comments such characters, nor end with a as those beginning with "<?" hyphen. Violating this is a are deprecated). The content well-formedness error. of comments must not start with a single U+003E GREATER-THAN SIGN ('>') character, nor start with a U+002D HYPHEN-MINUS (-) character followed by a U+003E GREATER-THAN SIGN ('>') character, nor contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a U+002D HYPHEN-MINUS (-) character. Violating these constraints is a parse error and various error recovery procedures are applied appropriately.

Only use comments of the "<!--...->" variety. Do not use two consecutive U+002D HYPHENMINUS (-) characters in comment content or end with such a hyphen (especially for the sake of XML). Do not begin comments with a single U+003E GREATE R-THAN SIGN ('>') character, nor with a U+002D HYPHENMINUS (-) character followed by a U+003E GREATE

HTML does not allow processing instructions and Processi deprecates the bogus ng comments which appear in Instruct their form, whether in the ions form <?foo ...> (without a closing '?') or <?foo ...?>.

XHTML allows the use of XML processing instructions which are only closed by "? >".

CDATA <![CDATA[...]]> is a a sections bogus comment. The sequence of characters "]]>" in content when it does not mark the end of a CDATA section is just regular character data.

is a CDATA section. The sequence of characters "]]>" in content when it does not mark the end of a CDATA section is a wellformedness error.
<![CDATA[...]]>

R-THAN SIGN ('>') character. Avoid ">" inside processing instruction s (as these will close the "instructio n" (comment) prematurel y) (or one must strip out processing instruction s entirely). Processing instruction s might need to be avoided entirely in case HTML may in future disallow them completely. Ensure sequence "]]>" in content is escaped (not necessary to escape in attribute values). Do not use CDATA

sections. Unescaped ampersands (U+0026 AMPERSAND - &, instead of &amp;) are permitted within the content of normal elements, RCDATA Unescaped ampersands and elements, foreign elements less-than signs may not and attribute values where Unescap they are not considered to be appear within CharData or ed ambiguous ampersands, and AttValue (basically, the Special within Raw text elements. normal text content of Charact elements and attribute ers Unescaped less than signs values.) Violation of this (U+003C LESS-THAN SIGN constraint is a wellformedness error. - <, instead of &lt;) are permitted in Raw text elements, RCDATA elements and attribute values, excluding the unquoted attribute value syntax. Charact er Referen ces There is no formal DTD for XHTML5, but one could provide an exteranl DTD (if not an internal subset?) for use with one's entity-checking (or validating) parser, but be aware that browsers do not universally use external Entity In HTML, all entity entity-checking (or Referen references are predefined and validating) parsers and may ces do not require a DTD. not read the external DTD. (Some still have bugs in that they mistakenly create a wellformedness error out of such missing entities instead of showing them as missing, making them clickable, or using a entity-checking or validating parser.) Charact The Always escape ampersands and lessthan signs in text content and attribute values. See CDATA for need to escape sequence "]]>" in text content.

Do not use entity references in XHTML (except for the 5 predefined entities: &amp;, &lt;, &gt;, &quot; and &apos;); use the equivalent Unicode or numeric character reference sequence instead.

er data

valid set of unicode charact ers in XML 1.0 is limited beyond that in HTML (we need to specify this here).

Element-specific parsing

In HTML, the script and style elements are parsed as CDATA elements. (Note: the definition of CDATA differs from that in XML). In XML, they're parsed as normal elements (which means that things that look like comments are treated as real comments, and things that look like start tags actually are start tags). In HTML, the title and textarea elements are parsed as RCDATA elements. (Note: The definition of RCDATA differs from that in SGML and there is no RCDATA in XML). In HTML, if scripting is enabled, the noscript element is parsed as an CDATA element. If scripting is disabled, it's parsed as a normal element. In XHTML, the element is always parsed as a normal element, and can't really be used to stop content from being present when script is disabled. In HTML, the iframe, noembed and noframes elements are parsed as CDATA elements. In XHTML, they are parsed as normal elements, and therefore do not stop content from being used. In HTML, tags for certain elements, which appear out of context, are ignored. This includes caption, col, colgroup, frame, frameset, head, option, optgroup, tbody, td, tfoot, th, thead, tr. In XHTML, table elements may contain child tr elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation). The plaintext element has a special parsing requirement in HTML. (It is, however, forbidden.) In HTML, a line feed that immediately follows a pre, listing or textarea start tag is ignored. Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML. (such as?)

The following are void elements in HTML (see void elements in table): In head (base, link, meta), in body (area,br, col, embed, hr, img, input, param)

HTML Elements with Optional Tags


Element html head body li dt dt p colgroup thead tbody tfoot tr th td rt rp optgroup option Start Tag End Tag optional optional optional optional optional optional required optional required optional required optional required optional optional optional required optional optional optional required optional required optional required optional required optional required optional required optional required optional required optional

Scripts

and document.writeln() cannot be used in XHTML, they can in HTML. In XHTML, the use of the innerHTML property requires that the string be a wellformed fragment of XML. DOM APIs are case sensitive in XHTML and some are case insensitive in HTML. (This does not apply to elements which are not in the HTML namespace) o Element.tagName and Node.nodeName return the value in uppercase in HTML but lower-case in XHTML (Node.localName is consistent now, as of HTML5). o Document.createElement() is case insensitive (the canonical form is lowercase). o Element.setAttributeNode() will change the attribute name to lowercase. o Element.setAttribute() is case insensitive (the canonical form is lowercase).
document.write()

Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive. o Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name will be lowercased before the rename takes place. In HTML, Document.createElement() will create an element in the HTML namespace. In XML (including XHTML), the namespace is defined by both DOM2 and DOM3 to be null. o In XHTML, browsers lack interoperability in this area. In Firefox and Safari, the namespace is dependent upon the MIME type. In Opera, it's dependent upon the root element. XPath expressions targeted at pre-HTML5 browsers need to use the XHTML namespace for XHTML and null for HTML. (HTML5 browsers would use the XHTML namespace even in HTML.)
o

Stylesheets

Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML. CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML. For polyglot documents, use lower-case element selectors and style the html and body elements appropriately (?).

You might also like