XProc 2.0 – Go with the Flow

The new draft of XProc 2.0 won’t have much in common with version 1.0. What a look over Norman Walsh’s shoulder between the sessions of the past XML Amsterdam already foreshadowed: At this year’s XML Prague, the W3C Working Group revealed a slimmed down version of XProc as Alex Milowski announced in a tweet.

XProc was published as W3C Recommendation entitled XProc: An XML Pipeline Language in May 2010. Enough time has passed to gain knowledge about strengths and weaknesses of the first version. Let’s start to discuss the advantages of XProc.

The processing of XML files frequently involves multiple steps. For example, an XML file is validated with a schema and converted with XSLT to other output formats. Before XProc, some glue code was necessary to stick the steps together. This code usually required a specific system with preinstalled software to run.

In contrast, XProc provides a declarative vocabulary to specify XML-based data flows. You just have to meet the requirements of the XProc processor to run XProc pipelines. Basically, an XProc pipeline consists of input and output declarations and a set of steps. Below you can find a brief example that is taken from the W3C Recommendation and includes a validation and a subsequent XSLT transformation.

<?xml version="1.0" encoding="UTF-8"?>
<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" version="1.0">

  <p:choose>
    <p:when test="/*[@version &lt; 2.0]">
      <p:validate-with-xml-schema>
        <p:input port="schema">
          <p:document href="v1schema.xsd"/>
        </p:input>
      </p:validate-with-xml-schema>
    </p:when>

    <p:otherwise>
      <p:validate-with-xml-schema>
        <p:input port="schema">
          <p:document href="v2schema.xsd"/>
        </p:input>
      </p:validate-with-xml-schema>
    </p:otherwise>
  </p:choose>

  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="stylesheet.xsl"/>
    </p:input>
  </p:xslt>
</p:pipeline>

In addition to the declarative vocabulary and the interoperability, XProc provides mechanisms for re-using pipelines. It’s possible to import an XProc pipeline into another, such that the compound pipeline acts as a single XProc step. An example provides our getting started guide on http://transpect.io. It features the import of our docx2hub converter as an XProc module. XProc’s import mechanism enables developers to write modular applications and avoid redundancies.

As mentioned above, XProc comes with some disadvantages. First, XProc is hard to understand for beginners, because of the implicit and explicit connections between input and output ports. Without modularization, the code will also quickly become very extensive and therefore difficult to understand. Troubleshooting is also considered difficult, which is complicated by some obscure error messages of XProc processors.

XProc is also a kind of determined by design. The original purpose of XProc was to process XML-based documents. But in fact, publishing workflows include binary files like images and videos. It’s a common task to extract information from text-based formats such as CSS, YAML and CSV. Furthermore, JSON has become very important for exchanging data over web services.

There are workarounds that allow an XProc processor to ingest and emit these data, too. Binary files are commonly accompanied by XML metadata. CSS, YAML and CSV can be parsed with regular expressions and there exist extensions for JSON. Unfortunately, XProc 1.0 lacks of a standard way to deal with these kinds of data.

In this context, the next version should not only be easier, but also serve as general purpose language for data flows which include different kinds of data. The proposal was presented by Alex Milowski at the Public XProc WG Meeting at XML Prague. The announced slimming cure for XProc included a new text-based syntax which is no longer based on XML but shares some similarities with XQuery. Here is an example from the GitHub page of the Working Group, which implements the code above in XProc 2.0. For a deeper insight, you may want to look into the README and the example code at the page.

xproc version = "2.0";

 inputs $source as document-node(); 1
outputs $result as document-node();

$source → {if (xs:decimal($1/*/@version) < 2.0)
        then [$1,"v1schema.xsd"] → validate-with-xml-schema() ≫ @1
        else [$1,"v2schema.xsd"] → validate-with-xml-schema() ≫ @1}
        → [$1,"stylesheet.xsl"] → xslt() 2
≫ $result 3

What stands out immediately is that the input and output ports are declared like variables 1. Steps such as xslt() 2 resemble a function invocation. The output is a variable which can be used 3 by other steps across the pipeline.

The proposal for XProc 2.0 also introduced the concept of block expressions, which can be described as inlined expressions between regular steps, used for small amounts of computation. Each block expression may be replaced by regular steps and shouldn’t be essential for a pipeline. But they will allow developers to write less code.

{if (xs:decimal($1/*/@version) < 2.0)
 then [$1 1,"v1schema.xsd"] → validate-with-xml-schema() ≫ @1 2
 else [$1,"v2schema.xsd"] → validate-with-xml-schema() ≫ @1}

Due to the new nature of the steps, ports are now ordered like function arguments. The variable reference $1 refers to the first input. The output binding operator (>> or U+226B) binds the output of the expression to the first output port @1.

The new arrow operator (-> or U+2192) is used to connect a linear sequence of steps. In the new terminology, this step sequence is denoted as step chain.

[source="document.xml", stylesheet="style.xs"] → xslt()

The example above shows that the proposal simplifies the declaration of ports, too. It is no longer necessary to declare primary and secondary ports. In XProc 1.0, a port is considered as primary if it is the only port or explicitly marked with primary="true".

Originally this method was considered as convenience for pipeline authors, because it was not necessary to connect each input and output port of steps explicitely. On the other hand, these implicit port connections proved to be a veritable source of errors in practice.

One downside of XProc 1.0 was that you could only specify whether a port expects a single document or a sequence of documents. According to the draft of XProc 2.0, you are able to assign datatypes to options 1 and ports 2. The number of expected documents can be expressed with the common quantifiers ?, *, +.

step p:xslt(
    $initial-mode as xs:string ?, 1
    $template-name as xs:string?,
    $output-base-uri as xs:string?,
    $parameters as map()? = (),
    $version as xs:string = "2.0"
  )
     inputs $source as document-node()+, 2
            $stylesheet as document-node()
     outputs $result as document-node()?,
             $secondary as document-node()*;

However, the slimming of XProc appeared to many attendees rather like starvation. But given the early stage of the draft, it’s obvious that some questions remain open. So the draft does not mention to which extent XPath would be part of the next version. Since XProc 2.0 claims to deal with non-XML markup as well, XPath would presumably be only part of the XML steps, even if it might be worth to consider XPath as query language for other tree-based formats.

The question remains how backwards-compatibility will be achieved in XProc 2.0. Will an XProc-2.0 processor still be capable of running XProc 1.0 pipelines? Will it be feasible to use steps authored in version 1.0 in 2.0 pipelines? Will there be a migration tool? This issue is connected to the question whether XProc 2.0 should be considered as successor to XProc 1.0 or something else?

Despite the improvements, the radical change of the syntax was subject to criticism. Even if the new syntax is pretty straightforward to write, XML provides an established syntax with a broad ecosystem of technologies for transformation and analysis of the pipelines themselves, such as XSLT, RelaxNG and Schematron. Because XProc 1.0 is XML-based, it’s easier to evaluate pipelines with other tools than the XProc processor. For example, we use an XProc pipeline called transpect-doc that orchestrates XSLT steps to generate documentation for complex XProc projects.

No doubt the new syntax is clean and easy to learn but an alternative XML syntax would be highly appreciated. XProc 2.0 could follow RelaxNG which provides a compact and an XML-based syntax. Two compatible syntaxes could meet the conflicting requirements of both authoring and processing.

An XML-based syntax must not necessarily be as verbose as in XProc 1.0. The use of XPath would allow users to express business logic with very compact expressions. A fictional XML-based XProc 2.0 propably could look like the sample code below. XPath is used to declare the input as conditional expression 2 and the datatype declaration is borrowed from XSLT 1. The select attribute can be used as shorthand for p:document, p:empty, p:pipe or p:data 3. A possible drawback is that flow logic is wrapped in XPath expressions, which makes it harder to parse the pipeline.

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="2.0">
  
  <p:input  port="source" as="document-node()"/> 1
  <p:output port="result" as="document-node()"/>
  
  <p:validate-with-xml-schema name="validate"
       select="$source, if(xs:decimal($source/*/@version) &lt; 2.0) 
                        then doc('v1schema.xsd')
                        else doc('v2schema.xsd')"> 2
  </p:validate-with-xml-schema>
  
  <p:xslt name="stylesheet">
    <p:input port="source"     select="p:pipe('validate')"/> 3
    <p:input port="stylesheet" select="doc('stylesheet.xsl')"/>
    <p:input port="parameters" select="p:empty()"/>
  </p:xslt>
  
</p:declare-step>

Another approach would be to use XProc 1.0 as alternative XML syntax for XProc 2.0. New steps could be declared as extension steps with a specific namespace. The namespace would also apply to new attributes (e.g. to add data types). Block expressions and other structures which include subpipelines could be expressed with the corresponding XProc 1.0 compound and multi-container steps. Following the concept of 2.0, each port connection would be declared explicitly.

Despite the open questions regarding an alternative XML syntax and XPath, many improvements have been introduced. The new syntax is straightforward and more focused on the data flow than on steps and their implicit and explicit port connections. Step chains, block expressions and output variables facilitate the writing for pipeline authors. Another improvement are data types for ports and variables which provide more control over the data flow. Furthermore, a W3C working group was founded to gather additional use cases and for discussion. With all this in mind, I am looking forward for future iterations of XProc (or whatever it will be called).

Kommentar schreiben