microformats2-parsing: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
m (Changed protection level for "microformats2-parsing": allow registered users [edit=autoconfirmed:move=autoconfirmed])
(Proposed for e- parsing, normalized absolute URLs in all URL attributes except those with fragment only values, to solve issue 38 and allow for relative links to *inside* e-* properties to keep working in embedded use-cases)
 
(105 intermediate revisions by 11 users not shown)
Line 1: Line 1:
<entry-title>microformats2 parsing specification</entry-title>
{{DISPLAYTITLE:microformats2 parsing specification}}
 
<dfn style="font-style:normal;font-weight:bold">[[microformats2]]</dfn> is a simple, open format for marking up data in HTML. The microformats2 parsing specification describes how to [[#implementations|implement]] a microformats2 parser, independent of any specific vocabularies.
<span class="h-card vcard"><span class="p-name fn">[[User:Tantek|Tantek Çelik]]</span> (<span class="p-role role">Editor</span>)</span>
----
<dfn style="font-style:normal;font-weight:bold">[[microformats2]]</dfn> is a simple, open format for marking up data in HTML. The microformats2 parsing specification describes how to [[#implementations|implement]] a microformats2 parser.
 
One of the goals of [[microformats2]] is to greatly simplify parsing of microformats, in particular, by making parsing independent of any one vocabulary. This specification documents the microformats2 parsing algorithm for doing so.
 
{{cc0-owfa-license}}


;<span id="status">Status</span>
:This is a '''Living Specification''' with several interoperable [[#implementations|implementations]]. This specification is stable, subject to editorial changes only for improving clarity of existing meaning. While substantive changes are unexpected, it is a living specification subject to substantive change by issues and errata filed in response to implementation experience, requiring consensus among participating implementers (since 2015-01-21) as part of an explicit [[#change_control|change control]] process. There are currently no draft or proposed new features in this specification, and if any were to be added, they would be explicitly labeled as such.<br/>Note: This specification is only marked as a "Draft Specification" because of pending edits from [[microformats2-parsing-issues|resolved issues before 2016-06-20]]. Once those edits have been completed, the link to <nowiki>[[Category:Draft Specifications]]</nowiki> at the bottom of this document should be changed to <nowiki>[[Category:Specifications]]</nowiki>.
;Participate
:<span id="issues">[https://github.com/microformats/microformats2-parsing/issues Open Issues]</span>
:[[microformats2-parsing-issues|Resolved issues before 2016-06-20]]
:[[IRC]]: [irc://irc.libera.chat/microformats #microformats on Libera]
<div class="p-author h-card vcard">
;<span class="p-role role">Editor</span>
:<span class="p-name fn">[[User:Tantek|Tantek Çelik]]</span>
</div>
;License
: {{cc0-owfa-license}}
__TOC__
== algorithm ==
== algorithm ==
=== parse a document for microformats ===
=== parse a document for microformats ===
To parse a document for microformats:
To parse a document for microformats, follow the HTML parsing rules and do the following:
* start with an empty JSON "items" array and "rels" hash:  
* start with an empty JSON "items" array and "rels" & "rel-urls" hashes:  
<source lang=javascript>
<syntaxhighlight lang=json>
{
{
  "items": [],
"items": [],
  "rels": {}
"rels": {},
"rel-urls": {}
}
}
</source>
</syntaxhighlight>
* parse the root element for class microformats, adding to the JSON items array accordingly
* parse the root element for class microformats, adding to the JSON items array accordingly
* parse all hyperlink (<code>&lt;link> &lt;a></code>) elements for rel microformats, adding to the JSON rels hash accordingly
* parse all hyperlink (<code>&lt;a> &lt;area> &lt;link></code>) elements for rel microformats, adding to the JSON rels & rel-urls hashes accordingly
* return the resulting JSON
* return the resulting JSON
Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).
Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).
Line 26: Line 33:
=== parse an element for class microformats ===
=== parse an element for class microformats ===
To parse an element for class microformats:
To parse an element for class microformats:
* parse element class for root class name(s) "h-x" (and backcompat)
* parse element class for root class name(s) "h-*" and if none, backcompat root classes
** if not found, parse child elements for microformats (depth first, doc order)
** if none found, parse child elements for microformats (depth first, doc order)
** else if found, start parsing a new microformat
** else if found, start parsing a new microformat
*** keep track of whether the root class name(s) was from backcompat
*** create a new { } structure with:
**** <code>type: <nowiki>[array of unique microformat "h-*" type(s) on the element sorted alphabetically]</nowiki>,</code>
**** <code>properties: { } </code> - to be filled in when that element itself is parsed for microformats properties
**** if the element has a non-empty <code>id</code> attribute:
***** <code>id:</code> string value of element's id attribute
*** parse child elements (document order) by:
*** parse child elements (document order) by:
**** parse a child element for properties (p-,u-,dt-,e-)
**** if parsing a backcompat root, parse child element class name(s) for backcompat properties
***** add properties found to current microformat
**** else parse a child element class for property class name(s) "p-*,u-*,dt-*,e-*"
**** if such class(es) are found, it is a property element
***** add properties found to current microformat's <code>properties: { } </code> structure
**** parse a child element for microformats (recurse)
**** parse a child element for microformats (recurse)
***** if that child element itself has a microformat and is a property element, add it into the array of values for that property
***** if that child element itself has a microformat ("h-*" or backcompat roots) and is a property element, add it into the array of values for that property as a { } structure, add to that { } structure:
****** <code>value</code>:
******* if it's a <code>p-*</code> property element, use the first p-name of the h-* child
******* else if it's an <code>e-*</code> property element, re-use its { } structure with existing <code>value:</code> inside.
******* else if it's a <code>u-*</code> property element and the h-* child has a u-url, use the first such u-url
******* else use the parsed property value per p-*,u-*,dt-* parsing respectively
***** else add found elements that are microformats to the "children" array
***** else add found elements that are microformats to the "children" array
*** imply properties for the found microformat (see below)
*** imply properties for the found microformat (see below)
The "*" for root (and property) class names consists of an optional vendor prefix (series of 1+ number or lowercase a-z characters i.e. <code>[0-9a-z]+</code>, followed by '-'), then one or more '-' separated lowercase a-z words.


=== parse an element for properties ===
=== parse an element for properties ===
==== parsing a p- property ====
To parse an element for a p-x property value:
* parse the element for the [[value-class-pattern]], if a value is found then return it.
* if abbr.p-x[title], then return the title attribute
* else if data.p-x[value], then return the value attribute
* else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
* else return the innertext of the element, replacing any nested <code>&lt;img></code> elements with their <code>alt</code> attribute if present, or otherwise their <code>src</code> attribute if present.


==== parsing a u- property ====
==== parsing a <code>p-</code> property ====
To parse an element for a u-x property value:
 
* if a.u-x[href] or area.u-x[href], then get the href attribute
To parse an element for a <code>p-x</code> property value (whether explicit <code>p-*</code> or backcompat equivalent):
* else if img.u-x[src], then get the src attribute
 
* else if object.u-x[data], then get the data attribute
* Parse the element for the [[value-class-pattern]]. If a value is found, return it.
* if there is a gotten value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element if any).
* If <code>abbr.p-x[title]</code> or <code>link.p-x[title]</code>, return the <code>title</code> attribute.
* else parse the element for the [[value-class-pattern]], if a value is found then return it.
* else if <code>data.p-x[value]</code> or <code>input.p-x[value]</code>, then return the <code>value</code> attribute
* else if abbr.u-x[title], then return the title attribute
* else if <code>img.p-x[alt]</code> or <code>area.p-x[alt]</code>, then return the <code>alt</code> attribute
* else if data.u-x[value], then return the value attribute
* else return the <code>textContent</code> of the element after:
* else return the innertext of the element.
** dropping any nested <code>&lt;script&gt;</code> &amp; <code>&lt;style&gt;</code> elements;
** replacing any nested <code>&lt;img&gt;</code> elements with their <code>alt</code> attribute, if present; otherwise their <code>src</code> attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative;
** removing all leading/trailing spaces
 
==== parsing a <code>u-</code> property ====
 
To parse an element for a <code>u-x</code> property value (whether explicit <code>u-*</code> or backcompat equivalent):
 
* if <code>a.u-x[href]</code> or <code>area.u-x[href]</code> or <code>link.u-x[href]</code>, then get the <code>href</code> attribute
* else if <code>img.u-x[src]</code> return the result of "parse an img element for src and alt" (see Sec.1.5)
* else if <code>audio.u-x[src]</code> or <code>video.u-x[src]</code> or <code>source.u-x[src]</code> or <code>iframe.u-x[src]</code>, then get the <code>src</code> attribute
* else if <code>video.u-x[poster]</code>, then get the <code>poster</code> attribute
* else if <code>object.u-x[data]</code>, then get the <code>data</code> attribute
* else parse the element for the [[value-class-pattern]]. If a value is found, get it
* else if <code>abbr.u-x[title]</code>, then get the <code>title</code> attribute
* else if <code>data.u-x[value]</code> or <code>input.u-x[value]</code>, then get the <code>value</code> attribute
* else get the <code>textContent</code> of the element after removing all leading/trailing spaces and nested <code>&lt;script></code> &amp; <code>&lt;style></code> elements
* return the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element, if any).


==== parsing a dt- property ====
==== parsing a <code>dt-</code> property ====
To parse an element for a dt-x property value:
 
* parse the element for the [[value-class-pattern]] including the date and time parsing rules, if a value is found then return it.
To parse an element for a <code>dt-x</code> property value (whether explicit <code>dt-*</code> or backcompat equivalent):
* if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
 
* else if abbr.dt-x[title], then return the title attribute
* parse the element for the [[value-class-pattern]], including the date and time parsing rules. If a value is found, then return it.
* else if data.dt-x[value], then return the value attribute
* if <code>time.dt-x[datetime]</code> or <code>ins.dt-x[datetime]</code> or <code>del.dt-x[datetime]</code>, then return the <code>datetime</code> attribute
* else return the innertext of the element.
* else if <code>abbr.dt-x[title]</code>, then return the <code>title</code> attribute
* else if <code>data.dt-x[value]</code> or <code>input.dt-x[value]</code>, then return the <code>value</code> attribute
* else return the <code>textContent</code> of the element after removing all leading/trailing spaces and nested <code>&lt;script&gt;</code> &amp; <code>&lt;style&gt;</code> elements.
 
==== parsing an <code>e-</code> property ====
To parse an element for a <code>e-x</code> property value (whether explicit "<code>e-*"</code> or backcompat equivalent):
 
* return a dictionary with two keys:
** <code>html</code>: the <code>innerHTML</code> of the element by using the [https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments HTML spec: Serializing HTML Fragments algorithm], with leading/trailing spaces removed. Proposed: and normalized absolute URLs in all URL attributes except those that are fragment-only, e.g. start with '#'.([https://github.com/microformats/microformats2-parsing/issues/38 issue 38])
** <code>value</code>: the <code>textContent</code> of the element after:
*** dropping any nested <code>&lt;script&gt;</code> &amp; <code>&lt;style&gt;</code> elements;
*** replacing any nested <code>&lt;img&gt;</code> elements with their <code>alt</code> attribute, if present; otherwise their <code>src</code> attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative;
*** removing all leading/trailing spaces


==== parsing an e- property ====
To parse an element for a e-x property value:
* return the innerHTML of the element by using the [http://www.whatwg.org/specs/web-apps/current-work/multipage/the-end.html#serializing-html-fragments HTML spec: Serializing HTML Fragments algorithm].


==== parsing for implied properties ====
==== parsing for implied properties ====
To imply properties: (where h-x is the root microformat element being parsed)
 
* if no explicit "name" property,  
Imply properties only on explicit <code>h-x</code> class name root microformat element (no backcompat roots):
 
* if no explicit "<code>name</code>" property, and no other <code>p-*</code> or <code>e-*</code> properties, and no nested microformats,
* then imply by:
* then imply by:
** if img.h-x then use its alt attribute for name
** if <code>img.h-x</code> or <code>area.h-x</code>, then use its <code>alt</code> attribute for name
** else if abbr.h-x[title] then use its title attribute for name
** else if <code>abbr.h-x[title]</code> then use its <code>title</code> attribute for name
** else if .h-x>img:only-child then use that img alt for name
** else if <code>.h-x>img:only-child[alt]:not([alt=""]):not[.h-*]</code> then use that <code>img</code>’s <code>alt</code> for name
** else if .h-x>abbr:only-child[title] then use that abbr title for name
** else if <code>.h-x>area:only-child[alt]:not([alt=""]):not[.h-*]</code> then use that <code>area</code>’s <code>alt</code> for name
** else if .h-x>:only-child>img:only-child use that img alt for name
** else if <code>.h-x>abbr:only-child[title]:not([title=""]):not[.h-*]</code> then use that <code>abbr</code> <code>title</code> for name
** else if .h-x>:only-child>abbr:only-child[title] use that abbr title for name
** else if <code>.h-x>:only-child:not[.h-*]>img:only-child[alt]:not([alt=""]):not[.h-*]</code> then use that <code>img</code>’s <code>alt</code> for name
** else use the innertext of the .h-x for name
** else if <code>.h-x>:only-child:not[.h-*]>area:only-child[alt]:not([alt=""]):not[.h-*]</code> then use that <code>area</code>’s <code>alt</code> for name
** drop leading & trailing white-space from name, including nbsp
** else if <code>.h-x>:only-child:not[.h-*]>abbr:only-child[title]:not([title=""]):not[.h-*]</code> use that <code>abbr</code>’s <code>title</code> for name
* if no explicit "photo" property,  
** else use the <code>textContent</code> of the <code>.h-x</code> for <code>name</code> after:
*** dropping any nested <code>&lt;script&gt;</code> &amp; <code>&lt;style&gt;</code> elements;
*** replacing any nested <code>&lt;img&gt;</code> elements with their <code>alt</code> attribute, if present;
** remove all leading/trailing spaces
* if no explicit "<code>photo</code>" property, and no other explicit <code>u-*</code> (Proposed: change to: <code>u-*</code> or <code>e-*</code>) properties, and no nested microformats,
* then imply by:
* then imply by:
** if img.h-x[src] then use src for photo
** if <code>img.h-x[src]</code>, then use the result of "parse an img element for src and alt" (see Sec.1.5) for photo
** else if object.h-x[data] then use data for photo
** else if <code>object.h-x[data]</code> then use <code>data</code> for photo
** else if .h-x>img[src]:only-of-type then use that img src for photo
** else if <code>.h-x>img[src]:only-of-type:not[.h-*]</code> then use the result of "parse an img element for src and alt" (see Sec.1.5) for photo
** else if .h-x>object[data]:only-of-type then use that object data for photo
** else if <code>.h-x>object[data]:only-of-type:not[.h-*]</code> then use that <code>object</code>’s <code>data</code> for photo
** else if .h-x>:only-child>img[src]:only-of-type then use that img src for photo
** else if <code>.h-x>:only-child:not[.h-*]>img[src]:only-of-type:not[.h-*]</code>, then use the result of "parse an img element for src and alt" (see Sec.1.5) for photo
** else if .h-x>:only-child>object[data]:only-of-type then use that object data for photo
** else if <code>.h-x>:only-child:not[.h-*]>object[data]:only-of-type:not[.h-*]</code>, then use that <code>object</code>’s <code>data</code> for photo
* if no explicit "url" property,
** if there is a gotten photo value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element, if any).
* if no explicit "<code>url</code>" property, and no other explicit <code>u-*</code> (Proposed: change to: <code>u-*</code> or <code>e-*</code>) properties, and no nested microformats,
* then imply by:
* then imply by:
** if a.h-x[href] then use href for url
** if <code>a.h-x[href]</code> or <code>area.h-x[href]</code> then use that <code>[href]</code> for url
** else if .h-x>a[href]:only-of-type then use that a[href] for url
** else if <code>.h-x>a[href]:only-of-type:not[.h-*]</code>, then use that <code>[href]</code> for url
** else if <code>.h-x>area[href]:only-of-type:not[.h-*]</code>, then use that <code>[href]</code> for url
** else if <code>.h-x>:only-child:not[.h-*]>a[href]:only-of-type:not[.h-*]</code>, then use that <code>[href]</code> for url
** else if <code>.h-x>:only-child:not[.h-*]>area[href]:only-of-type:not[.h-*]</code>, then use that <code>[href]</code> for url
** if there is a gotten url value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element, if any).
 
<strong>Note:</strong> The same markup for a property should not be causing that property to occur in <em>both</em> a microformat and one embedded inside - such a property should only be showing up on one of them. The parsing algorithm has details to prevent that, such as the <code>:not[.h-*]</code> tests above.


=== parse a hyperlink element for rel microformats ===
=== parse a hyperlink element for rel microformats ===
To parse a hyperlink element for rel microformats: (where * is the hyperlink element)
To parse a hyperlink element (e.g. a or link) for rel microformats: use the following algorithm or an algorithm that produces equivalent results:
* if the "rel" attribute of the element is empty then exit
* if the "rel" attribute of the element is empty then exit
* set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element if any).
* set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element if any).
* treat the "rel" attribute of the element as a space separate set of rel values
* treat the "rel" attribute of the element as a space separate set of rel values
* if the set of rel values does NOT have "alternate" then
* for each rel value (rel-value)
** for each rel value (rel-value)
** if there is no key rel-value in the rels hash then create it with an empty array as its value
*** if there is no key rel-value in the rels hash then create it with an empty array as its value
** if url is not in the array of the key rel-value in the rels hash then add url to the array
*** add url to the array of the key rel-value in the rels hash
* end for
** end for
* if there is no key with name url in the top-level "rel-urls" hash then add a key with name url there, with an empty hash value
* else
* add keys to the hash of the key with name url for each of these attributes (if present) and key not already set:
** if there is no top level "alternates" key in the JSON, then create it with an empty array as its value
** "hreflang": the value of the "hreflang" attribute
** add a new hash to the array with keys for each of these attributes when present:
** "media": the value of the "media" attribute
*** "url": url
** "title": the value of the "title" attribute
*** "rel": the set of rel values appended with spaces, except "alternate"
** "type": the value of the "type" attribute
*** "media": the value of the "media" attribute
** "text": the text content of the element if any
*** "hreflang": the value of the "hreflang" attribute
* if there is no "rels" key in that hash, add it with an empty array value
*** "type": the value of the "type" attribute
* set the value of that "rels" key to an array of all unique items in the set of rel values unioned with the current array value of the "rels" key, sorted alphabetically.
* end if


==== rel parse examples ====
==== rel parse examples ====
Line 118: Line 171:


E.g. parsing this markup:
E.g. parsing this markup:
<source lang=xml>
<syntaxhighlight lang=html>
<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/b">author b</a>
<a rel="author" href="http://example.com/b">author b</a>
Line 127: Line 180:
   media="handheld"
   media="handheld"
   hreflang="fr">French mobile homepage</a>
   hreflang="fr">French mobile homepage</a>
</source>
</syntaxhighlight>


Would generate this JSON:
Would generate this JSON:
<source lang=javascript>
<syntaxhighlight lang=json>
{
{
   "items": [],
   "items": [],
   "rels": {  
   "rels": {  
     "author": [ "http://example.com/a", "http://example.com/b" ],
     "author": [ "http://example.com/a", "http://example.com/b" ],
     "in-reply-to": [ "http://example.com/1", "http://example.com/2" ]  
     "in-reply-to": [ "http://example.com/1", "http://example.com/2" ],
    "alternate": [ "http://example.com/fr" ],
    "home": [ "http://example.com/fr" ]  
   },
   },
   "alternates": [{
   "rel-urls": {
    "url": "http://example.com/fr",
    "http://example.com/a": {
    "rel": "home",  
      "rels": ["author"],
    "media": "handheld",  
      "text": "author a"
    "hreflang": "fr"  
    },
   }]
    "http://example.com/b": {
      "rels": ["author"],
      "text": "author b"
    },
    "http://example.com/1": {
      "rels": ["in-reply-to"],
      "text": "post 1"
    },
    "http://example.com/2": {
      "rels": ["in-reply-to"],
      "text": "post 2"
    },
    "http://example.com/fr": {
      "rels": ["alternate", "home"],
      "media": "handheld",  
      "hreflang": "fr",
      "text": "French mobile homepage"
    }
   }
}
}
</source>
</syntaxhighlight>
 
 
=== parse an img element for src and alt ===
To parse an <code>img</code> element for <code>src</code> and <code>alt</code> attributes:
* if <code>img[alt]</code>
** return a new <code>{}</code> structure with
*** <code>value</code>: the element's <code>src</code> attribute as a normalized absolute URL, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element, if any).
*** <code>alt</code>: the element's <code>alt</code> attribute
* else
** return the element's <code>src</code> attribute as a normalized absolute URL, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <code>&lt;base&gt;</code> element, if any).


Another parse output example can be found here:
* https://gist.github.com/barnabywalters/5480962


== what do the CSS selector expressions mean ==
== what do the CSS selector expressions mean ==
Line 153: Line 234:


Use [http://gallery.theopalgroup.com/selectoracle/ SelectORacle] to expand any of the above CSS selector expressions into longform English prose.
Use [http://gallery.theopalgroup.com/selectoracle/ SelectORacle] to expand any of the above CSS selector expressions into longform English prose.
Exception:
* ''':not[.h-*]''' is not a valid CSS selector but is used here to mean:
** does not have any class names that start with "h-"
== note HTML parsing rules ==
''This section is non-normative.''
microformats2 parsers are expected to follow HTML parsing rules, which includes for example:
* ignore <code>&lt;template&gt;</code> elements - stuff between <code>&lt;template&gt;</code> tags don't end up in the DOM
** test-case in the wild: http://sixtwothree.org/blog/now-accepting-webmentions/
'spaces' in the algorithm refers to HTML collapsible blank spaces of any kind, whether ' ' or linefeed, carriage return, etc.
== note backward compatibility details ==
The parsing algorithm and details refer to "backcompat root classes" (backcompat roots for short) and "backcompat properties". These conditions and steps in the algorithm document how to parse pre-microformats2 microformats which all defined their own specific root class names and explicit sets of properties.
Some details to be aware of (which are explicitly in the algorithm, this is just an informal summary)
* If an element has one or more microformats2 root class name(s) (<code>h-*</code>)
** all backcompat root class names are ignored on that element.
** all backcompat properties, without an intervening root class name, are ignored inside that element
* If an element has only a backcompat root class name (or names)
** all microformats2 property class names (p-* u-* dt-* e-*), without an intervening element with root class name, are ignored inside that element
** there is no implied property value parsing (p-name, u-url, u-photo) for that element
=== backward compatibility mappings ===
Note: several parser implementations have encoded backward compatible mappings into source and data files. Implementers of parsers may find these useful:
* search for "modules.maps[" in https://github.com/glennjones/microformat-shiv/blob/master/microformat-shiv.js
* search for "$classicPropertyMap" in https://github.com/indieweb/php-mf2/blob/master/Mf2/Parser.php
* https://github.com/tommorris/mf2py/blob/master/mf2py/backcompat.py


== questions ==
== questions ==
See the FAQ:
See the FAQ:
* [[microformats2-parsing-faq]]
* [[microformats2-parsing-faq]]
== issues ==
See the issues page:
* [[microformats2-parsing-issues]]


== implementations ==
== implementations ==
{{main|microformats2#Implementations}}
{{main|microformats2#Implementations}}
There are open source [[microformats2#Implementations|microformats2 parsers]] available for Javascript, node.js, PHP, and Ruby.
There are open source [[microformats2#Implementations|microformats2 parsers]] available for Javascript, node.js, PHP, Ruby and Python.


== test suite ==
== test suite ==
See:
See:
* https://github.com/microformats/tests
* https://github.com/microformats/tests
* https://github.com/indieweb/php-mf2/tree/master/tests/mf2
* https://github.com/indieweb/php-mf2/tree/master/tests/Mf2


Ports to/for other languages encouraged.
Ports to/for other languages encouraged.
== change control ==
Minor editorial changes (e.g. fixing minor typos or punctuation) that do not change and preferably clarify the structure and existing intended meaning may be done by anyone without filing issues, requiring only a sufficient "Summary" description field entry for the edit. More than minor but still purely editorial changes may be made by an editor. Anyone may question such editorial changes by undoing corresponding edits without filing an issue. Any further reversion or iteration on such an editorial change must be done by filing an issue.
Per the stable status of this document, substantive issue filing, resolution, and edits are done with the following change control steps, which may nearly all be done asynchronously once an issue is filed to reach the required state of "Resolve by implementation verified rough consensus". All steps should be openly documented (e.g. on this wiki or GitHub issues) such that others may later verify the history of an issue, and all steps are encouraged to be announced on #microformats [[IRC]] with a link to the issue.
* '''File an issue.''' Anyone may file a new [[microformats2-parsing-issues|issue]] and is encouraged to do so, with the caveat that per the stable status, only issues originating as a result of implementation experience are likely to have a substantive impact on this specification.
* '''Propose a resolution.''' Anyone may propose resolutions to existing issues, and should encourage others in the community, especially implementers, to provide feedback. Proposed resolutions should include what specific text of the spec needs changing, preferably with replacement text, and test cases if applicable (e.g. a resolution could just document the current state more precisely without needing to provide any new test cases).
* '''Provide feedback on proposed resolution(s).''' Anyone may provide feedback on proposed resolutions with their name attached, in summary (e.g. +1/0/-1 opinions), and additionally with reasoning (required for objections, optional otherwise), or optionally with suggested improvements, or counter-proposals (per "Propose a resolution" above).
* '''Iterate to resolve objections if any.''' If there are any objections to a proposed resolution, proposer(s) and anyone agreeing should work to iterate on the proposal to resolve objections to the satisfaction (or at least withdrawal of objection) of anyone objecting. The more objections resolved the better, and incremental progress is forward progress.
* '''Broaden implementer consensus.''' Proposer(s) and anyone agreeing should reach out (e.g. via #microformats [[IRC]]) to multiple implementers of the specification to get their opinions and feedback on proposal(s). The more implementers providing feedback the better. Iterate to resolve any new objections per "Iterate to resolve objections if any" above.
* '''Encourage and get 1+ implementation(s).''' Encourage, get, and document 1+ implementation(s) of implementation affecting aspects of a proposed resolution, preferably with a test case if applicable.
* '''Resolve by implementation verified rough consensus.''' Once there is rough consensus on a proposal (where <dfn>rough consensus</dfn> means <strong>either no objections, or at a minimum no objections by implementers, and explicit positive opinions by 2+ implementers</strong>) and a proposal's feasibility is verified by at least 1 implementation of aspects of the proposal that affect implementations (none needed if there are none, e.g. purely editorial), cite those in a "Resolution:" statement on the issue (e.g. in a comment), and explicitly share this resolution statement and link to issue in the #microformats [[IRC]] channel.
* '''Edit specification.''' Normally the spec editor(s) will make edits per implementation verified rough consensus resolutions, however anyone (though especially issue discussion participants) may make a specification edit per a resolution if they are able to verify the citations that the resolution has achieve rough consensus, and has 1+ implementation(s) per "Encourage and get 1+ implementation(s)" above. Edits must contain a "Summary" field entry that at a minimum mention the issue by name, should provide a URL to the issue resolution, and preferably be done by a spec editor or an implementer. Once the edit is made, the issue should be closed, or at least a comment made on the issue requesting that the opener of the issue close the issue.
These change control steps are inspired by the tradition of "Rough consensus and running code" as exhibited by example by IETF and W3C processes, and in that regard, seek to be a philosophically compatible approach to specification iteration. They have been in rough practice since 2015-01-21, increasingly strictly applied since then with consensus of issue discussion participants, and explicitly documented based on issue resolving and spec editing experience.


== see also ==
== see also ==

Latest revision as of 19:23, 8 February 2023

microformats2 is a simple, open format for marking up data in HTML. The microformats2 parsing specification describes how to implement a microformats2 parser, independent of any specific vocabularies.

Status
This is a Living Specification with several interoperable implementations. This specification is stable, subject to editorial changes only for improving clarity of existing meaning. While substantive changes are unexpected, it is a living specification subject to substantive change by issues and errata filed in response to implementation experience, requiring consensus among participating implementers (since 2015-01-21) as part of an explicit change control process. There are currently no draft or proposed new features in this specification, and if any were to be added, they would be explicitly labeled as such.
Note: This specification is only marked as a "Draft Specification" because of pending edits from resolved issues before 2016-06-20. Once those edits have been completed, the link to [[Category:Draft Specifications]] at the bottom of this document should be changed to [[Category:Specifications]].
Participate
Open Issues
Resolved issues before 2016-06-20
IRC: #microformats on Libera
Editor
Tantek Çelik
License
Per CC0, to the extent possible under law, the editors have waived all copyright and related or neighboring rights to this work. In addition, as of 2024-11-28, the editors have made this specification available under the Open Web Foundation Agreement Version 1.0.

algorithm

parse a document for microformats

To parse a document for microformats, follow the HTML parsing rules and do the following:

  • start with an empty JSON "items" array and "rels" & "rel-urls" hashes:
{
 "items": [],
 "rels": {},
 "rel-urls": {}
}
  • parse the root element for class microformats, adding to the JSON items array accordingly
  • parse all hyperlink (<a> <area> <link>) elements for rel microformats, adding to the JSON rels & rel-urls hashes accordingly
  • return the resulting JSON

Parsers may simultaneously parse the document for both class and rel microformats (e.g. in a single tree traversal).

parse an element for class microformats

To parse an element for class microformats:

  • parse element class for root class name(s) "h-*" and if none, backcompat root classes
    • if none found, parse child elements for microformats (depth first, doc order)
    • else if found, start parsing a new microformat
      • keep track of whether the root class name(s) was from backcompat
      • create a new { } structure with:
        • type: [array of unique microformat "h-*" type(s) on the element sorted alphabetically],
        • properties: { } - to be filled in when that element itself is parsed for microformats properties
        • if the element has a non-empty id attribute:
          • id: string value of element's id attribute
      • parse child elements (document order) by:
        • if parsing a backcompat root, parse child element class name(s) for backcompat properties
        • else parse a child element class for property class name(s) "p-*,u-*,dt-*,e-*"
        • if such class(es) are found, it is a property element
          • add properties found to current microformat's properties: { } structure
        • parse a child element for microformats (recurse)
          • if that child element itself has a microformat ("h-*" or backcompat roots) and is a property element, add it into the array of values for that property as a { } structure, add to that { } structure:
            • value:
              • if it's a p-* property element, use the first p-name of the h-* child
              • else if it's an e-* property element, re-use its { } structure with existing value: inside.
              • else if it's a u-* property element and the h-* child has a u-url, use the first such u-url
              • else use the parsed property value per p-*,u-*,dt-* parsing respectively
          • else add found elements that are microformats to the "children" array
      • imply properties for the found microformat (see below)

The "*" for root (and property) class names consists of an optional vendor prefix (series of 1+ number or lowercase a-z characters i.e. [0-9a-z]+, followed by '-'), then one or more '-' separated lowercase a-z words.

parse an element for properties

parsing a p- property

To parse an element for a p-x property value (whether explicit p-* or backcompat equivalent):

  • Parse the element for the value-class-pattern. If a value is found, return it.
  • If abbr.p-x[title] or link.p-x[title], return the title attribute.
  • else if data.p-x[value] or input.p-x[value], then return the value attribute
  • else if img.p-x[alt] or area.p-x[alt], then return the alt attribute
  • else return the textContent of the element after:
    • dropping any nested <script> & <style> elements;
    • replacing any nested <img> elements with their alt attribute, if present; otherwise their src attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative;
    • removing all leading/trailing spaces

parsing a u- property

To parse an element for a u-x property value (whether explicit u-* or backcompat equivalent):

  • if a.u-x[href] or area.u-x[href] or link.u-x[href], then get the href attribute
  • else if img.u-x[src] return the result of "parse an img element for src and alt" (see Sec.1.5)
  • else if audio.u-x[src] or video.u-x[src] or source.u-x[src] or iframe.u-x[src], then get the src attribute
  • else if video.u-x[poster], then get the poster attribute
  • else if object.u-x[data], then get the data attribute
  • else parse the element for the value-class-pattern. If a value is found, get it
  • else if abbr.u-x[title], then get the title attribute
  • else if data.u-x[value] or input.u-x[value], then get the value attribute
  • else get the textContent of the element after removing all leading/trailing spaces and nested <script> & <style> elements
  • return the normalized absolute URL of the gotten value, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).

parsing a dt- property

To parse an element for a dt-x property value (whether explicit dt-* or backcompat equivalent):

  • parse the element for the value-class-pattern, including the date and time parsing rules. If a value is found, then return it.
  • if time.dt-x[datetime] or ins.dt-x[datetime] or del.dt-x[datetime], then return the datetime attribute
  • else if abbr.dt-x[title], then return the title attribute
  • else if data.dt-x[value] or input.dt-x[value], then return the value attribute
  • else return the textContent of the element after removing all leading/trailing spaces and nested <script> & <style> elements.

parsing an e- property

To parse an element for a e-x property value (whether explicit "e-*" or backcompat equivalent):

  • return a dictionary with two keys:
    • html: the innerHTML of the element by using the HTML spec: Serializing HTML Fragments algorithm, with leading/trailing spaces removed. Proposed: and normalized absolute URLs in all URL attributes except those that are fragment-only, e.g. start with '#'.(issue 38)
    • value: the textContent of the element after:
      • dropping any nested <script> & <style> elements;
      • replacing any nested <img> elements with their alt attribute, if present; otherwise their src attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative;
      • removing all leading/trailing spaces


parsing for implied properties

Imply properties only on explicit h-x class name root microformat element (no backcompat roots):

  • if no explicit "name" property, and no other p-* or e-* properties, and no nested microformats,
  • then imply by:
    • if img.h-x or area.h-x, then use its alt attribute for name
    • else if abbr.h-x[title] then use its title attribute for name
    • else if .h-x>img:only-child[alt]:not([alt=""]):not[.h-*] then use that img’s alt for name
    • else if .h-x>area:only-child[alt]:not([alt=""]):not[.h-*] then use that area’s alt for name
    • else if .h-x>abbr:only-child[title]:not([title=""]):not[.h-*] then use that abbr title for name
    • else if .h-x>:only-child:not[.h-*]>img:only-child[alt]:not([alt=""]):not[.h-*] then use that img’s alt for name
    • else if .h-x>:only-child:not[.h-*]>area:only-child[alt]:not([alt=""]):not[.h-*] then use that area’s alt for name
    • else if .h-x>:only-child:not[.h-*]>abbr:only-child[title]:not([title=""]):not[.h-*] use that abbr’s title for name
    • else use the textContent of the .h-x for name after:
      • dropping any nested <script> & <style> elements;
      • replacing any nested <img> elements with their alt attribute, if present;
    • remove all leading/trailing spaces
  • if no explicit "photo" property, and no other explicit u-* (Proposed: change to: u-* or e-*) properties, and no nested microformats,
  • then imply by:
    • if img.h-x[src], then use the result of "parse an img element for src and alt" (see Sec.1.5) for photo
    • else if object.h-x[data] then use data for photo
    • else if .h-x>img[src]:only-of-type:not[.h-*] then use the result of "parse an img element for src and alt" (see Sec.1.5) for photo
    • else if .h-x>object[data]:only-of-type:not[.h-*] then use that object’s data for photo
    • else if .h-x>:only-child:not[.h-*]>img[src]:only-of-type:not[.h-*], then use the result of "parse an img element for src and alt" (see Sec.1.5) for photo
    • else if .h-x>:only-child:not[.h-*]>object[data]:only-of-type:not[.h-*], then use that object’s data for photo
    • if there is a gotten photo value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).
  • if no explicit "url" property, and no other explicit u-* (Proposed: change to: u-* or e-*) properties, and no nested microformats,
  • then imply by:
    • if a.h-x[href] or area.h-x[href] then use that [href] for url
    • else if .h-x>a[href]:only-of-type:not[.h-*], then use that [href] for url
    • else if .h-x>area[href]:only-of-type:not[.h-*], then use that [href] for url
    • else if .h-x>:only-child:not[.h-*]>a[href]:only-of-type:not[.h-*], then use that [href] for url
    • else if .h-x>:only-child:not[.h-*]>area[href]:only-of-type:not[.h-*], then use that [href] for url
    • if there is a gotten url value, return the normalized absolute URL of it, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).

Note: The same markup for a property should not be causing that property to occur in both a microformat and one embedded inside - such a property should only be showing up on one of them. The parsing algorithm has details to prevent that, such as the :not[.h-*] tests above.

parse a hyperlink element for rel microformats

To parse a hyperlink element (e.g. a or link) for rel microformats: use the following algorithm or an algorithm that produces equivalent results:

  • if the "rel" attribute of the element is empty then exit
  • set url to the value of the "href" attribute of the element, normalized to be an absolute URL following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element if any).
  • treat the "rel" attribute of the element as a space separate set of rel values
  • for each rel value (rel-value)
    • if there is no key rel-value in the rels hash then create it with an empty array as its value
    • if url is not in the array of the key rel-value in the rels hash then add url to the array
  • end for
  • if there is no key with name url in the top-level "rel-urls" hash then add a key with name url there, with an empty hash value
  • add keys to the hash of the key with name url for each of these attributes (if present) and key not already set:
    • "hreflang": the value of the "hreflang" attribute
    • "media": the value of the "media" attribute
    • "title": the value of the "title" attribute
    • "type": the value of the "type" attribute
    • "text": the text content of the element if any
  • if there is no "rels" key in that hash, add it with an empty array value
  • set the value of that "rels" key to an array of all unique items in the set of rel values unioned with the current array value of the "rels" key, sorted alphabetically.

rel parse examples

Here are some examples to show how parsed rels may be reflected into the JSON (empty items key).

E.g. parsing this markup:

<a rel="author" href="http://example.com/a">author a</a>
<a rel="author" href="http://example.com/b">author b</a>
<a rel="in-reply-to" href="http://example.com/1">post 1</a>
<a rel="in-reply-to" href="http://example.com/2">post 2</a>
<a rel="alternate home"
   href="http://example.com/fr"
   media="handheld"
   hreflang="fr">French mobile homepage</a>

Would generate this JSON:

{
  "items": [],
  "rels": { 
    "author": [ "http://example.com/a", "http://example.com/b" ],
    "in-reply-to": [ "http://example.com/1", "http://example.com/2" ],
    "alternate": [ "http://example.com/fr" ], 
    "home": [ "http://example.com/fr" ] 
  },
  "rel-urls": {
    "http://example.com/a": {
      "rels": ["author"], 
      "text": "author a"
    },
    "http://example.com/b": {
      "rels": ["author"], 
      "text": "author b"
    },
    "http://example.com/1": {
      "rels": ["in-reply-to"], 
      "text": "post 1"
    },
    "http://example.com/2": {
      "rels": ["in-reply-to"], 
      "text": "post 2"
    },
    "http://example.com/fr": {
      "rels": ["alternate", "home"],
      "media": "handheld", 
      "hreflang": "fr", 
      "text": "French mobile homepage"
    }
  }
}


parse an img element for src and alt

To parse an img element for src and alt attributes:

  • if img[alt]
    • return a new {} structure with
      • value: the element's src attribute as a normalized absolute URL, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).
      • alt: the element's alt attribute
  • else
    • return the element's src attribute as a normalized absolute URL, following the containing document's language's rules for resolving relative URLs (e.g. in HTML, use the current URL context as determined by the page, and first <base> element, if any).


what do the CSS selector expressions mean

This section is non-normative.

Use SelectORacle to expand any of the above CSS selector expressions into longform English prose.

Exception:

  • :not[.h-*] is not a valid CSS selector but is used here to mean:
    • does not have any class names that start with "h-"

note HTML parsing rules

This section is non-normative.

microformats2 parsers are expected to follow HTML parsing rules, which includes for example:

'spaces' in the algorithm refers to HTML collapsible blank spaces of any kind, whether ' ' or linefeed, carriage return, etc.

note backward compatibility details

The parsing algorithm and details refer to "backcompat root classes" (backcompat roots for short) and "backcompat properties". These conditions and steps in the algorithm document how to parse pre-microformats2 microformats which all defined their own specific root class names and explicit sets of properties.

Some details to be aware of (which are explicitly in the algorithm, this is just an informal summary)

  • If an element has one or more microformats2 root class name(s) (h-*)
    • all backcompat root class names are ignored on that element.
    • all backcompat properties, without an intervening root class name, are ignored inside that element
  • If an element has only a backcompat root class name (or names)
    • all microformats2 property class names (p-* u-* dt-* e-*), without an intervening element with root class name, are ignored inside that element
    • there is no implied property value parsing (p-name, u-url, u-photo) for that element

backward compatibility mappings

Note: several parser implementations have encoded backward compatible mappings into source and data files. Implementers of parsers may find these useful:

questions

See the FAQ:

implementations

Main article: microformats2#Implementations

There are open source microformats2 parsers available for Javascript, node.js, PHP, Ruby and Python.

test suite

See:

Ports to/for other languages encouraged.

change control

Minor editorial changes (e.g. fixing minor typos or punctuation) that do not change and preferably clarify the structure and existing intended meaning may be done by anyone without filing issues, requiring only a sufficient "Summary" description field entry for the edit. More than minor but still purely editorial changes may be made by an editor. Anyone may question such editorial changes by undoing corresponding edits without filing an issue. Any further reversion or iteration on such an editorial change must be done by filing an issue.

Per the stable status of this document, substantive issue filing, resolution, and edits are done with the following change control steps, which may nearly all be done asynchronously once an issue is filed to reach the required state of "Resolve by implementation verified rough consensus". All steps should be openly documented (e.g. on this wiki or GitHub issues) such that others may later verify the history of an issue, and all steps are encouraged to be announced on #microformats IRC with a link to the issue.

  • File an issue. Anyone may file a new issue and is encouraged to do so, with the caveat that per the stable status, only issues originating as a result of implementation experience are likely to have a substantive impact on this specification.
  • Propose a resolution. Anyone may propose resolutions to existing issues, and should encourage others in the community, especially implementers, to provide feedback. Proposed resolutions should include what specific text of the spec needs changing, preferably with replacement text, and test cases if applicable (e.g. a resolution could just document the current state more precisely without needing to provide any new test cases).
  • Provide feedback on proposed resolution(s). Anyone may provide feedback on proposed resolutions with their name attached, in summary (e.g. +1/0/-1 opinions), and additionally with reasoning (required for objections, optional otherwise), or optionally with suggested improvements, or counter-proposals (per "Propose a resolution" above).
  • Iterate to resolve objections if any. If there are any objections to a proposed resolution, proposer(s) and anyone agreeing should work to iterate on the proposal to resolve objections to the satisfaction (or at least withdrawal of objection) of anyone objecting. The more objections resolved the better, and incremental progress is forward progress.
  • Broaden implementer consensus. Proposer(s) and anyone agreeing should reach out (e.g. via #microformats IRC) to multiple implementers of the specification to get their opinions and feedback on proposal(s). The more implementers providing feedback the better. Iterate to resolve any new objections per "Iterate to resolve objections if any" above.
  • Encourage and get 1+ implementation(s). Encourage, get, and document 1+ implementation(s) of implementation affecting aspects of a proposed resolution, preferably with a test case if applicable.
  • Resolve by implementation verified rough consensus. Once there is rough consensus on a proposal (where rough consensus means either no objections, or at a minimum no objections by implementers, and explicit positive opinions by 2+ implementers) and a proposal's feasibility is verified by at least 1 implementation of aspects of the proposal that affect implementations (none needed if there are none, e.g. purely editorial), cite those in a "Resolution:" statement on the issue (e.g. in a comment), and explicitly share this resolution statement and link to issue in the #microformats IRC channel.
  • Edit specification. Normally the spec editor(s) will make edits per implementation verified rough consensus resolutions, however anyone (though especially issue discussion participants) may make a specification edit per a resolution if they are able to verify the citations that the resolution has achieve rough consensus, and has 1+ implementation(s) per "Encourage and get 1+ implementation(s)" above. Edits must contain a "Summary" field entry that at a minimum mention the issue by name, should provide a URL to the issue resolution, and preferably be done by a spec editor or an implementer. Once the edit is made, the issue should be closed, or at least a comment made on the issue requesting that the opener of the issue close the issue.

These change control steps are inspired by the tradition of "Rough consensus and running code" as exhibited by example by IETF and W3C processes, and in that regard, seek to be a philosophically compatible approach to specification iteration. They have been in rough practice since 2015-01-21, increasingly strictly applied since then with consensus of issue discussion participants, and explicitly documented based on issue resolving and spec editing experience.

see also