DITA pre-processing architecture

This document describes the different steps in preprocessing for DITA topics.

The DITA Toolkit implements a multi-stage, map-driven architecture to process DITA content. Each step in the process examines some or all of the content; some steps result in temporary files used by later steps, while others result in updated copies of the DITA content. Most of the processing takes place in a temporary working directory (the source files themselves are never modified).

Most of this process is common between all output formats, and is known as the "preprocess" stage. In general, any DITA process begins with this common set of pre-processing routines, after which processing specific to the output format (such as PDF or XHTML) takes over. The stages of the pre-processing pipeline are described below, after the following note on processing order.

A note on processing order

The order of the steps below is significant. Although the DITA specification does not mandate a specific order for processing, the toolkit has over time found that the current order best meets user expectations. Switching the order of processing may give different results.

For example, if conref is evaluated before filtering, it is possible to reuse content that will later be filtered out of its original location. However, we have found that filtering first provides several benefits. For example, the following <note> element uses conref, but also contains a product attribute:
<note conref="documentA.dita#doc/note" product="MyProd"></note>

If the conref attribute is evaluated first, then documentA must be parsed in order to retrieve the note content. That content is then stored in the current document (or in a representation of that document in memory). However, if all content with product="MyProd" is filtered out, then that work is all discarded later in the build.

However, if the filtering is done first as in the toolkit, this element is discarded immediately, and documentA is never examined. This provides several important benefits:
  • Time is saved simply by discarding unused content as early as possible; all future steps can load the document without this extra content.
  • More significant time is saved in this case by not evaluating the conref attribute; in fact, documentA does not even need to be parsed.
  • Any user reproducing this build does not need documentA. If the content is sent to a translation team, that team can reproduce an error-free build without documentA; this means documentA can be kept back from translation, preventing accidental translation and increased costs.

gen-list: generate lists for use by later steps

The gen-list step examines the input files and creates lists of topics, images, or other content. These lists are used by later steps in the pipeline. For example, one list includes all topics that make use of the conref attribute; only those files are processed during the conref stage of the build. This step is implemented in Ant and Java.

The result of this list is a set of several list files in the temporary directory, including dita.list and dita.xml.properties. A non-exhaustive selection of each list is given in the table below.

Table 1. Sample lists included in the dita.list file
Property Name Usage
conreflist Documents that contains conref attribute that need to be resolved in preprocess.
fullditamapandtopiclist All of the ditamap and topic files that are referenced during the transformation. These may be referenced by href or conref attributes.
fullditamaplist All of the ditamap files in dita.list
hrefditatopiclist All of the topic files that are referenced with an href attribute
fullditatopiclist All of the topic files in dita.list
imagelist Images files that are referenced in the content

Debug and filter

This stage processes all referenced DITA content, and creates copies in a temporary directory for use during the remainder of the build. This step is implemented in Java.

As the files are copied, the following modifications are made:
  • The files are filtered according to entries in any specified DITAVAL file.
  • Debug information is inserted into each element (using the xtrf and xtrc element). These values allow messages later in the build to reliably indicate the original source of the error - for example, the fifth <ph> element in a specific file.
  • Adjust column names in tables to use a common naming scheme. This is done only to simplify later conref processing; for example, if a table row is pulled into another table, this ensures that a reference to "column 5" will continue to work in the fifth column of the new table.

Copy related files

This step copies related non-DITA resources to the output directory, including referenced HTML files and images referenced by DITAVAL files.

Conref push

This step resolves "conref push" references. Conref push is a new feature in the DITA 1.2 specification. This step only processes documents that use conref push (or that are updated due to the push action. The step is implemented in Java.

Resolve conref

This step resolves traditional conref attributes. It is implemented in XSLT. The step only runs on documents that use the conref attribute. It is implemented in XSLT.

move metadata entries

This step pushes metadata back and forth between maps and topics. For example, index entries and copyrights in the map are pushed into affected topics, so that topics may be processed later in isolation while retaining all relevant metadata. This step is implemented in Java.

Keyref

Keyref is perhaps the most significant new feature in the DITA 1.2 standard. This step examines all keys defined in the source material, and updates content appropriately. Links that make use of keys are updated so that any href value is replaced by the appropriate target; key based text replacement is also evaluated. This step is implemented in Java.

Coderef

The codref element was added in DITA 1.2, and allows references to non-DITA code for use in a codeblock element. This step reads the referenced document, and adds the content to the referencing DITA topic. This step is implemented in Java.

Mapref

The mapref step resolves references to other maps, such as
<topicref href="other.ditamap" format="ditamap"></topicref>

This step is implemented in XSLT.

Mappull

The mappull step pulls content from referenced topics into maps. For example, by default, the navigation title on a topicref is replaced by a title from the referenced topic. Short descriptions and link text are also established during this stage. Finally, cascading attributes are made explicit; for example, if a topicref does not set the toc attribute while the parent attribute sets toc="no", that toc="no" value is explicitly set for the child. This step is implemented in XSLT.

Chunk

The chunk step breaks apart and assembles referenced DITA content based on the chunk attribute in maps. For example, the chunk attribute may be used to assemble all referenced topics into a single document; alternatively, it may do the opposite and break one document with multiple topics into several individual documents. This step is implemented in Java.

Maplink and move-links

These steps work together to add links to topics based on the location of those topics in a map. This is where family links (such as parent, child, next, and previous) are created, as well as links based on relationship tables. The maplink step (implemented in XSLT) generates all of the links, and places them in one large file in the temporary directory. Next, the move-links step (implemented in Java) pushes all of the generated links into the proper topics.

topicpull:

The topicpull step is similar to the mappull step, except that it runs on topics. This is used to pull in titles for <xref> and <link> elements that do not already specify link text. The step is implemented in XSLT.