Shell automated scripts

From ioChem-BD wiki
Jump to: navigation, search

JUMBO-Converters are modules that transform inputs into outputs, usually 1:1 such as Foo2CML and CML2Foo.

Primary site: [1]

Running the converters

This page is concerned with the philosophy and design of the converters. If you are just interested in running them, please go the the Tutorials and problems page.

Overview of the parsing philosophy

The approach that has been adopted by the parsers is to break the monolithic text block of the logfile into a series of separate chunks that encapsulate a coherent piece of data.

There may be many repeated chunks within a log file. For example, if a chunk is an SCF calculation, for a single-point energy calculation there would just be a single SCF chunk, whereas for a geometry optimisation calculation, there would be as many SCF chunks as there were SCF calculations.

Chunks are often nested, so using the geometry optimisation example, a single geometry optimisation step would itself be a chunk, and this in turn would contain (one or more) SCF chunks. There would then be as many geometry optimisation step chunks as there were geometry optimisation steps.

For a more detailed explanation, please see the pages on ["Chunkers"] and ["Block"], and also the older ["How converters work"].

Parsing is currently a multi-stage process. The parser reads the log files and converts it into "raw" XML. This process splits the file up into modules, each module corresponding to a chunk in the file. Within each module the text is preserved, so there is no loss of data from the log file; additional structure and annotation have just been added.

Each module is then parsed separately. The module may be parsed into a number of sub-modules or have data extracted into records.

The process of parsing a module into a record is the only process that removes text from the log file, but this is also the process of marking up data, so again nothing should be lost.

At the end of the parsing process we have a raw XML file that contains all of the information from the original log file, separated up into modules and with known quantities marked up with XML.

The terms in the raw XML are defined in the code-specific dictionary, which describes what each of the quantities are, their units etc.

This raw XML is then transformed into CML in a second step, where the quantities in the code-specific dictionary are either mapped onto CML or domain-specific dictionaries, and additional annotations or properties can be added (e.g. bond lengths could be calculated etc.).

Actual implementation

Jumbo converters are written in Java, although the template parsing technology is described entirely in XML, so that once a new parser module has been created, only XML files need to be edited in order to extend and develop the parser.

The reference parser for computational chemistry is the NWChem parser, so any examples will refer to it.

The java class that controls the two-stage parsing for the NWChem is NWChemLog2CompchemConverter.java.

The first stage (controlled by the NWChemLog2XMLConverter.java class), uses the topTemplate.xml file to include the various XML templates that parse the different chunks of the logfile.

The second stage (controlled by the NWChemLogXML2CompchemConverter.java class), uses the transforms in the nwchem2compchem.xml to manipulate the raw XML into a convention-compliant form.

See ["Declarative parsing syntax"] for a complete list of the rules followed by the parsers and their relations to the template XML files.

Parsing a module with a template

The structure of a typical template is shown below with comments to explain the various sections.

<!-- The template is contained in an XML element, with the behaviour controlled by various attributes of the form ATTRIBUTE="VALUE". See the 'Template Attributes' section below for more information --> <template id="foo" pattern="…">

<!-- The templates contain their own unit-testing framework in the comments. See the 'Unit Testing Framework' section below for more information --> <comment class="example.input" id="foo"> EXAMPLE LOGFILE TEXT </comment>

<!-- Templates can themselves include other templates using a templateList. Only templates, or include directives to include other templates should be in a templateList --> <templateList xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="basis.summary.xml"/> </templateList>

<!-- The record is the mechanism to extract text into XML. See the 'Records' section below for further details --> <record id="iter" repeat="*">\s*{I,compchem:iterationIndex}\s+{F,compchem:totalEnergy}\s+{E,n:gnorm}\s+{E,n:gmax}\s+{F,compchem:wallTime}</record>

<!-- The XML elements created with the records can be manipulated with transforms. See the 'Transforming the raw XML' section below for more information --> <transform process="addUnits" xpath=".//cml:scalar[@dictRef='compchem:totalEnergy']" value="nonsi:hartree" />

<!-- This is part of the unit testing framework, and contains the marked-up text that should be created from the EXAMPLE LOGFILE TEXT above. See the 'Unit Testing Framework' section below for more information --> <comment class='"example.output" id="foo"> PARSED OUTPUT </comment>

</template>

Template Attributes

The possible ATTRIBUTES on a template are:

  • id - this should be a unique identifier. The text that is parsed by the template will be extracted into a cml module with a templateRef (in the cmlx namespace) of the id. In other words the text parsed by the template with id="foo" will end up in a module as shown below:
<module xmlns="http://www.xml-cml.org/schema" xmlns:cmlx="http://www.xml-cml.org/schema/cmlx" cmlx:templateRef="foo">

PARSED TEXT

</module>
  • name - used to give a name to the template and is currently unused.
  • repeat - the number of times that a template will be matched within the file. If repeat="1" then the template will only be matched once, regardless how many times the pattern is matched in the file. repeat="*" means the template will be matched as many times as the pattern is matched.
  • newline - the character that is used to indicate a new line in the regular expression used in the pattern or endPattern. The default is the dollar character, i.e. newline="$"
  • pattern, pattern2, pattern3… - the regular expression used to trigger this template to start parsing text. The pattern may extend over more then one line if the newline character (see above) features in the expression. For example, pattern="\Number One$\s*The Larch\s*" would only match the line "Number One" if it were followed by "The Larch". Multiple patterns can be specified using the attributes pattern2, pattern3...
  • endPattern, endPattern2, endPattern3… - as for pattern (see above), but this matches where the template stops parsing. If endPattern="~", then if the end of the text that this template is parsing is reached, the entire text will be included in the template. If the endPattern is anything other then "~", then no text will not be included in the template and the entire text will be available for matching by another template within the parent.
  • offset, offset2, offset3… - the number of lines either side of the match to include within the template. WIth the default of "0", all the text from (and including) the first matched line is included in the template. An offset of "-2" includes the two lines before the match, an offset of "3" excludes the first line of the match, and the two lines following it. If no offset is specified (or only offset is specified) the offset will apply to all matches (i.e. pattern, pattern2, pattern3…). If, for example, offset2 is specified, then this is the offset that will be applied when pattern2 is matched.
  • endOffset, endOffset2, endOffset3… - the number of lines to include in the template after the endPattern match. With the default of "0", the line matched by endPattern is NOT included, and this line is pushed into the containing template, where it may be matched by the pattern of another template. An endOffset of 2, includes the endPattern line, and the one after. An endOffest of "-1" excludes the line preceding the match as well.

Records

Records are the machinery used to extract text from a file and mark it up into XML.

A record is an XML element, which can have a number of attributes (see below) and which may contain a string, which is a simple regular expression-type language for determining what will be extracted and how it will be marked up.

Unlike the templates, where each template is tried in turn against each line of the file, records are processed sequentially. Each record is processed in turn until it fails, at which point the next record is processed until all records in the module have been processed.

An empty record (such as <record repeat="2"/>) can be used to "gobble" lines (which are discarded).

If the record has content, then the text of the line is parsed into a CML list with a templateRef as specified by the id of the record.

A simple example to read the XYZ format geometry printed in an NWChem output is shown below. The text that is to be parsed is:

XYZ format geometry ------------------- 11 geometry fe 0.00000000 0.00000000 0.00000000 c 0.00000000 0.00000000 1.80680057 o 0.77109980 -2.87778364 0.00000000

The records to parse this are:

<!-- Read 2 lines. The record has no content, so the lines are discarded. --> <record repeat="2"/>

<!-- Read a line with a single integer. The integer will be placed in a CML scalar with the dictRef "compchem:numAtoms". The scalar will itself be within a CML list with the templateRef of "atoms". --> <record id="atoms">\s*{I,compchem:numAtoms}\s*</record>

<!-- Read a line with a single character string. The string will be placed in a CML scalar with the dictRef "n:geomtype". The scalar will itself be within a CML list with the templateRef of "atoms". --> <record id="geo">\s*{A,n:geomtype}\s*</record>

<!-- Keep reading lines while they contain a character string, followed by 3 floats. Make an array of all matching variables. The arrays will be held in a CML list with the templateRef of "mol". --> <record makeArray="true" repeat="*" id="mol">\s*{A,compchem:elementType}\s*{F,compchem:x3}\s*{F,compchem:y3}\s*{F,compchem:z3}\s*</record>

This result of this parsing is as follows:

<list cmlx:templateRef="atoms"> <scalar dataType="xsd:integer" dictRef="compchem:numAtoms">11</scalar> </list> <list cmlx:templateRef="geo"> <scalar dataType="xsd:string" dictRef="n:geomtype">geometry</scalar> </list> <list cmlx:lineCount="3" cmlx:templateRef="mol"> <array dataType="xsd:string" dictRef="compchem:elementType" size="3">fe c o</array> <array dataType="xsd:double" dictRef="compchem:x3" size="3">0.0 0.0 0.7710998</array> <array dataType="xsd:double" dictRef="compchem:y3" size="3">0.0 0.0 -2.87778364</array> <array dataType="xsd:double" dictRef="compchem:z3" size="3">0.0 1.80680057 0.0</array> </list>

Unit Testing Framework

The templates contain their own internal testing framework, in the form of one or more pairs of comment blocks within them.

A comment block with the class attribute "example.input" should contain a small representative chunk of text that the parsers can be tested with. The id attribute is used to match the example input with the the representative output that should be produced when the template acts on the sample text.

An input comment is shown below:

<comment class="example.input" id="l601.fermi"> Isotropic Fermi Contact Couplings Atom a.u. MegaHertz Gauss 10(-4) cm-1 1 C(13) 0.02539 28.54777 10.18656 9.52251 2 C(13) 0.00582 6.54434 2.33518 2.18296 13 Cl(35) 0.05688 24.94015 8.89927 8.31914 -------------------------------------------------------- Center ---- Spin Dipole Couplings ---- 3XX-RR 3YY-RR 3ZZ-RR -------------------------------------------------------- 1 Atom 0.005300 -0.061839 0.056540 2 Atom -0.039723 -0.068059 0.107782 13 Atom 0.621221 -2.038530 1.417309 -------------------------------------------------------- XY XZ YZ -------------------------------------------------------- 1 Atom 0.000010 0.095387 0.000013 2 Atom 0.005157 0.081893 0.006262 13 Atom 0.000344 3.043747 0.000390 --------------------------------------------------------

</comment>


The matching example.output comment is below:

<comment class="example.output" id="l601.fermi"> <module cmlx:templateRef="l601.fermi" xmlns="http://www.xml-cml.org/schema" xmlns:cmlx="http://www.xml-cml.org/schema/cmlx"> <list cmlx:lineCount="3" cmlx:templateRef="fermi.atom"> <array dataType="xsd:integer" dictRef="cc:serial" size="3">1 2 13</array> <array dataType="xsd:string" dictRef="x:elementType" size="3">C C Cl</array> <array dataType="xsd:integer" dictRef="x:isotopeNumber" size="3">13 13 35</array> <array dataType="xsd:double" dictRef="cc:coupling" size="3">0.02539 0.00582 0.05688</array> <array dataType="xsd:double" dictRef="cc:coupling" size="3">28.54777 6.54434 24.94015</array> <array dataType="xsd:double" dictRef="cc:coupling" size="3">10.18656 2.33518 8.89927</array> <array dataType="xsd:double" dictRef="cc:coupling" size="3">9.52251 2.18296 8.31914</array> </list> <list cmlx:lineCount="3" cmlx:templateRef="fermi.spindipole"> <array dataType="xsd:integer" dictRef="cc:serial" size="3">1 2 13</array> <array dataType="xsd:double" dictRef="g:spindipole.xx" size="3">0.0053 -0.039723 0.621221</array> <array dataType="xsd:double" dictRef="g:spindipole.yy" size="3">-0.061839 -0.068059 -2.03853</array> <array dataType="xsd:double" dictRef="g:spindipole.zz" size="3">0.05654 0.107782 1.417309</array> </list> <list cmlx:lineCount="3" cmlx:templateRef="fermi.spindipole"> <array dataType="xsd:integer" dictRef="cc:serial" size="3">1 2 13</array> <array dataType="xsd:double" dictRef="g:spindipole.xy" size="3">1.0E-5 0.005157 3.44E-4</array> <array dataType="xsd:double" dictRef="g:spindipole.xz" size="3">0.095387 0.081893 3.043747</array> <array dataType="xsd:double" dictRef="g:spindipole.yz" size="3">1.3E-5 0.006262 3.9E-4</array> </list> </module> </comment>

It is possible for the templates to contain multiple examples, provided that each pair has matching id attributes. In this case, each matching pair will be tested in turn and all must pass for the unit test to be successful.

For the NWChem logfile templates, the code that runs these tests lives in the file:

TemplateUnitTests.java

To test and develop an individual template (using the xyz template as an example), the following line needs to be added to the TemplateTest.java file.

@Test public void testXyz() {runTemplateTest("xyz");}

The individual test can be run from within Eclipse, but from the command-line, it only appears possible to run all of the TemplateTests (see note below), using the following command, whilst sat in the jumboconverters-compchem/jc-compchem-nwchem directory:

mvn -Dtest="log.TemplateUnitTests" test

If you are developing a template, the first time this is run, it will fail. However, it will print out the output of running the test, and something like the following:

==============template===================

Error: template expected:<3> but was:<4> XMLDIFF reference


test---------------------

<?xml version="1.0" encoding="UTF-8"?> <module cmlx:templateRef="xyz" xmlns="http://www.xml-cml.org/schema" xmlns:cmlx="http://www.xml-cml.org/schema/cmlx"> <list cmlx:lineCount="3" cmlx:templateRef="fermi.atom"> <array dataType="xsd:integer" dictRef="cc:serial" size="3">1 2 13</array>

The chunk of test after the ------------test--------------------- line, and excluding the <?xml version="1.0" encoding="UTF-8"?> line is the output of the test. This should be checked, and if correct, placed in the <comment class="example.output" id="xyz"> tag in the template. Re-running the test should then lead to a successful result.

Note:' The discussion at stackoverflow and maven documentation suggests that the following syntax should work:

mvn -Dtest="log.TemplateTest#testXyz" test

But this appears not to be the case. Are we using the junit < 4.7?

Transforming the raw XML

As has been mentioned, the parsing is a two-stage process, consisting of marking up the file with xml and then converting the raw XML to valid CML. In some cases, the raw XML may already be valid CML, but it most cases transforms will need to be applied.

The transforms can either be applied within the template, after the text has been parsed and marked up, or as an entirely separate step, once the whole file has been parsed.

The transformation process relies heavily on the powerful XPath language. A short tutorial on xpath can be found here.

The philosophy of the transforms is very similar to the idea of templates in xslt, using the idea of "nodeset" to which operations are applied.

The transforms are carried out by elements like the following:

<transform process="addAttribute" xpath="./cml:module[@cmlx:templateRef='job']" name="id" value="job" />

In this case, the attribute id="job" will be added to all cml modules that are direct children of the document, and have the templateRef "job".

The transforms have a process which defines the operation that will be carried out, almost all have an xpath that is an xpath expression indicating the elements the process will be applied to (the nodeset), and a variable number of arguments, depending on the process being carried out.

A brief overview of the key transformations follows below, however, for those with a strong constitution, a more comprehensive documentation can be found by examining the code in the file TransformElement.java

The text from ~ line 160, starting with the comment // process values lists the processes that are available.

Various miscellaneous notes will be added in the section below, which will be merged into the documentation in due course.

Key Transforms

  • addAttribute - add an attribute of type name and value value to all nodes in the xpath nodeset. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH, then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by xpath, in string or number form. If name consists of two strings separated by a colon, SOMETHING EXCITING HAPPENS...

<transform process="addAttribute" xpath=".//cml:molecule" name="formalCharge" value="$number(.//cml:scalar[@dictRef='g:charge'])" />

  • addChild - this will create a child element of the nodes specified by the xpath. The only required argument is elementName, which specifies the type of element to create. Other supported arguments are: id, dictRef and value. The position argument specifies where the child will be created in the list of children. position="0" creates it as the first child, "1", the second etc. With no position argument, the child is added as the last child. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH, then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by xpath, in string or number form.

<transform process="addChild" xpath="." elementName="cml:module" id="jobList1" position="0" dictRef="cc:jobList" />


  • addDictRef - this will add a dictRef attribute with the specified value to the nodes defined by xpath.

<transform process="addDictRef" xpath="//cml:property[cml:module[@cmlx:templateRef='l601.popanal']]" value="cc:popanal "/>

  • addId - this adds an idwith the value specified by the value argument to the nodeset specified in the xpath.

<transform process="addId" value="mol9999" xpath=".//cml:molecule[starts-with(@id,'a')]" />

  • addMap - for every node in the nodeset specified by xpath, this creates a cml:map with the specified id that links the values of the nodes in the from nodeset to that in the to nodeset.

<transform process="addMap" xpath="." id="variableConstantMap" from=".//cml:scalar[@dictRef='g:variable' or @dictRef='g:const']" to=".//cml:scalar[@dictRef='g:value']" />

  • addNamespace - this will add a namespace element of the form xmlns:name="value" to every element in the nodeset returned by xpath.

<transform process="addNamespace" xpath="." name="convention" value="http://www.xml-cml.org/convention/" />


  • addSibling - this will add a sibling element to each node in the xpath nodeset, with the type of element being that specified in elementName and the elements id attribute as specified by id argument. The position argument indicates where the element will be created, "0" creates the element before node, "1" creates it after the current node. If there are multiple siblings to the current node, "-2" would create it 2 nodes down from the current node, "2", one up from it etc. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH, then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by xpath, in string or number form.

<transform process="addSibling" xpath="./cml:module[@id='calculation']/cml:module[@cmlx:templateRef='l202.rotconst']" elementName="cml:module" id="l202.group" position="1" />

  • addUnits - this will add a units attribute to the element with the value specified in the value argument. The value should be of the form namespace:id, where namespace refers to one of the units dictionaries and the id points to the actual unit. In the example below, the namespace refers to the non-si unit dictionary and the id links to the entry for the hartree.

<transform process="addUnits" xpath=".//cml:scalar[@dictRef='compchem:total_energy']" value="nonsi:hartree" />

  • copy - this copies the nodes defined by xpath to the xpath defined by the to argument, which is relative to the element being copied. e.g. if to is ".", then the element and its children will be copied to become children of itself. If the element has an id attribute, this will have the string ".copy" appended to it, if not, an id of "n.copy" will be created, where n is the index of the node in the original xpath.

<transform process="copy" xpath="(//cml:list[@cmlx:templateRef='l914_excit2'])[1]" to="."/>

  • createAngle - TODO

<transform process="createAngle" xpath=".//cml:list/cml:list[cml:atom]" atomRefs="$string(cml:scalar[3]) $string(cml:scalar[1]) $string(cml:atom/@id)" value="$string(cml:scalar[4])" />

  • createArray - this will create a cml:array at each of the nodes in the xpath query from the cml:scalar nodes generated by the from xpath query. If only one node is supplied, the contents of the node will be separated by whitespace and the array created from these. Arrays can only be created for integer or double data types. The scalar nodes with then be discarded.

<transform process="createArray" xpath="." from="./cml:list[@cmlx:templateRef='length']/cml:scalar[@dictRef='g:symbol']"/>

  • createAtom - TODO

<transform process="createAtom" xpath=".//cml:scalar[@dictRef='cc:elementType']" />

  • createDate - TODO

<transform process="createDate" xpath=".//cml:list[@dictRef='g:archive1']/cml:scalar[9]" format="dd-MMM-yyyy" dictRef="cc:date"/>

  • createDouble - TODO

<transform process="createDouble" xpath=".//cml:list[@dictRef='g:archive.namevalue']/cml:scalar[@dictRef='x:HF']" dictRef="cc:hfenergy" />

  • createForumla - TODO

<transform process="createFormula" xpath=".//cml:list[@dictRef='g:archive1']/cml:scalar[7]"/>

  • createLength - TODO

<transform process="createLength" xpath=".//cml:list/cml:list[cml:atom]" atomRefs="$string(cml:scalar[1]) $string(cml:atom/@id)" value="$string(cml:scalar[2])"/>

  • createList - this will take a list of nodes from the xpath and, if they are cml modules, it will convert them to cml lists.

<transform process="createList" xpath=".//cml:module[@cmlx:templateRef='multipole']"/>

  • createMatrix - TODO

<transform process="createMatrix33" xpath="." dictRef="g:axis" from=".//cml:scalar[contains(@dictRef,':x.') or contains(@dictRef,':y.') or contains(@dictRef,':z.')]" />

  • createMatrix33 - TODO

<transform process="createMatrix33" xpath="." dictRef="g:axis" from=".//cml:scalar[contains(@dictRef,':x.') or contains(@dictRef,':y.') or contains(@dictRef,':z.')]" />

  • createMolecule - this will create a molecule from the list of cml:arrays generated by the 'xpath' query. The length of the arrays indicates the number of atoms, and the dictRef attribute of each array determines the property of the atom it will be used for. Supported types are: x3, y3, z3 for the coordinates, id, elementType, label and atomTypeRef. The molecule will be created as a child of the parent of the first array, and the arrays will then be discarded. The gaussian template l202.orient.xml is shown below as an example.

<template id="l202.orient" name="input or standard orientation" repeat="*" pattern="\s*(Input|Standard)\s*orientation:\s*$\s*\-+\s*" endPattern="\s*\d.*$\s*\-+\s*" endOffset="2">

<comment class="example.input" id="l202.orient"> Input orientation: --------------------------------------------------------------------- Center Atomic Atomic Coordinates (Angstroms) Number Number Type X Y Z --------------------------------------------------------------------- 1 6 0 0.000000 0.000000 0.000000 2 1 0 0.000000 0.000000 1.093266 3 1 0 1.030741 0.000000 -0.364422 4 1 0 -0.515370 -0.892648 -0.364422 5 1 0 -0.515371 0.892648 -0.364422 --------------------------------------------------------------------- </comment>

<record repeat="5"/> <record repeat="*" makeArray="true" id="atom">{I,cc:serial}{I,cc:elementType}{I,g:atomicType}{F,cc:x3}{F,cc:y3}{F,cc:z3}</record> <record/>

<transform process="createMolecule" xpath="./cml:list[@cmlx:templateRef='atom']/cml:array" id="mol.l202.orient"/> <transform process="pullupSingleton" xpath="./cml:list"/>

<comment class="example.output" id="l202.orient"> <module cmlx:templateRef="l202.orient" xmlns="http://www.xml-cml.org/schema" xmlns:cmlx="http://www.xml-cml.org/schema/cmlx"> <molecule id="mol.l202.orient" cmlx:templateRef="atom"> <atomArray> <atom id="a1" elementType="C" x3="0.0" y3="0.0" z3="0.0"> <scalar dataType="xsd:integer" dictRef="cc:serial">1</scalar> <scalar dataType="xsd:integer" dictRef="g:atomicType">0</scalar> <scalar dataType="xsd:integer" dictRef="cc:atomicNumber">6</scalar> </atom> <atom id="a2" elementType="H" x3="0.0" y3="0.0" z3="1.093266"> <scalar dataType="xsd:integer" dictRef="cc:serial">2</scalar> <scalar dataType="xsd:integer" dictRef="g:atomicType">0</scalar> <scalar dataType="xsd:integer" dictRef="cc:atomicNumber">1</scalar> </atom> <atom id="a3" elementType="H" x3="1.030741" y3="0.0" z3="-0.364422"> <scalar dataType="xsd:integer" dictRef="cc:serial">3</scalar> <scalar dataType="xsd:integer" dictRef="g:atomicType">0</scalar> <scalar dataType="xsd:integer" dictRef="cc:atomicNumber">1</scalar> </atom> <atom id="a4" elementType="H" x3="-0.51537" y3="-0.892648" z3="-0.364422"> <scalar dataType="xsd:integer" dictRef="cc:serial">4</scalar> <scalar dataType="xsd:integer" dictRef="g:atomicType">0</scalar> <scalar dataType="xsd:integer" dictRef="cc:atomicNumber">1</scalar> </atom> <atom id="a5" elementType="H" x3="-0.515371" y3="0.892648" z3="-0.364422"> <scalar dataType="xsd:integer" dictRef="cc:serial">5</scalar> <scalar dataType="xsd:integer" dictRef="g:atomicType">0</scalar> <scalar dataType="xsd:integer" dictRef="cc:atomicNumber">1</scalar> </atom> </atomArray> <formula formalCharge="0" concise="C 1 H 4"> <atomArray elementType="C H" count="1.0 4.0"/> </formula> <bondArray> <bond atomRefs2="a1 a2" id="a1_a2" order="S"/> <bond atomRefs2="a1 a3" id="a1_a3" order="S"/> <bond atomRefs2="a1 a4" id="a1_a4" order="S"/> <bond atomRefs2="a1 a5" id="a1_a5" order="S"/> </bondArray> <property dictRef="cml:molmass"> <scalar dataType="xsd:double" units="unit:dalton" xmlns:unit="http://www.xml-cml.org/unit/si/">16.04246</scalar> </property> </molecule> </module> </comment> </template>

  • createNameValue - TODO

<transform process="createNameValue" xpath="./cml:list/cml:list" name=".//cml:scalar[@dictRef='x:name']" value=".//cml:scalar[@dictRef='x:value']" />

  • createString - if xpath returns a list arrays, then each array will be converted into a cml:scalar with a dataType xsd:string, with the value of the scalar being the values in the array, concatentated as strings and separated by whitespace. If xpath returns a list of cml:scalars, the the first scalar will be converted to type xsd:string, the value of which will be the concatenation of all the values in the remaining scalar nodes. The remaining scalar nodes will then be deleted. If a single node is returned by xpath and it is of instance text, then a new cml:scalar node will be created in its place with an optional id attribute as specified in the id argument.

<transform process="createString" xpath="./cml:list/cml:scalar"/>

  • createTable - TODO

<transform process="createTable" xpath=".//cml:list[@cmlx:templateRef='symmadapt']" />

  • createTorsion - TODO

<transform process="createTorsion" xpath=".//cml:list/cml:list[cml:atom]" atomRefs="$string(cml:scalar[5]) $string(cml:scalar[3]) $string(cml:scalar[1]) $string(cml:atom/@id)" value="$string(cml:scalar[6])" />

  • createVector3 - for each node specified in the xpath this will take the nodes listed in the to argument and create a cml:vector from them, and give it the specified dictRef. The to argument must return 3 cml:scalar nodes for this to work.

<transform process="createVector3" xpath="." dictRef="g:coupling.ten" from="./cml:list/cml:list/cml:scalar[contains(@dictRef,'.a.t') or contains(@dictRef,'.b.t') or contains(@dictRef,'.c.t')]" />

  • createWrapper - for each node in the xpath nodeset, this will create an enveloping element of type elementName that will become the child of the node's parent, and hold the node and all of its children. id and dictRef arguments are supported.

<transform process="createWrapper" xpath=".//cml:module/text()" elementName="UNPARSED"/>

  • createWrapperMetadata - for each node in the xpath nodelist, if the node is one of a cml:scalar, cml:array, cml:list, cml:table or cml:matrix, and has a "dictRef" attribute, it will remove the dictRef attribute and instead wrap the element in a cml:metadata element, so that, e.g.
    <scalar dataType="xsd:string" dictRef="n:basis_type">ao basis</scalar>
    becomes:

<metadata name="n:basis_type"> <scalar dataType="xsd:string">ao basis</scalar> </metadata>

<transform process="createWrapperMetadata" xpath=".//cml:scalar[@dictRef='cc:version' or @dictRef='cc:date' or @dictRef='cc:title']"/>

  • createWrapperParameter - this performs the same operation as createWrapperMetadata, but wraps the element in a cml:parameter with the dictRef of the target element.

<transform process="createWrapperParameter" xpath=".//cml:scalar[@dictRef='cc:hostname' or @dictRef='cc:jobname' or @dictRef='cc:method' or @dictRef='cc:basis' ]"/>

  • createWrapperProperty - - this performs the same operation as createWrapperMetadata, but wraps the element in a cml:property with the dictRef of the target element.

<transform process="createWrapperProperty" xpath=".//*[@dictRef='cc:electronicstate' or @dictRef='cc:hfenergy' or @dictRef='cc:dipole' or @dictRef='cc:dipolederiv' or @dictRef='cc:polarizability' or @dictRef='cc:pointgroup' or @dictRef='cc:rmsd' or @dictRef='cc:rmsf']"/>

  • createZMatrix - TODO

<transform process="createZMatrix" xpath="." id="zinitial"/>

  • delete - this will delete the list of nodes defined by the xpath, along with all of their child nodes.

<transform process="delete" xpath="(//cml:list[@cmlx:templateRef='l914_excit2'])[1]"/>

  • debugNodes - this just prints out the nodes selected by the xpath and is only useful for developing and debugging the transforms.

<transform process="debugNodes" xpath=".//cml:module[not(cml:array)]"/>

  • groupSiblings - TODO

<transform process="groupSiblings" xpath=".//cml:module[@id='l202.group']" />

  • joinArrays - with a single xpath argument, this will take the first array in the nodelist and join all the others to it, deleting the other arrays and leaving a single array with the dictRef of the original array. With an additional key argument, SOMETHING ELSE HAPPENS. With an additional from argument, SOMETHING ELSE HAPPENS.

<transform process="joinArrays" xpath=".//cml:list[@cmlx:templateRef='atom']/cml:array" />

  • move - this takes one or more nodes, and moves them into the node defined by the to argument. The to argument is an xpath that must just return a single element. The position argument indicates where in the children of the target, the element will be moved to. "1" makes the element the first, "2" makes it the second etc.

<transform process="move" to="." xpath=".//*[contains(@dictRef,':serial') or contains(@dictRef,':elementType') or contains(@dictRef,':isotop') or contains(@dictRef,':coupling')]" />

  • moveRelative - this is similar to move, but the to argument is a xpath that is relative to the element being moved, so that if the xpath returns a list of elements scattered from throughout the document, each will be moved to the to relative to itself.

<transform process="moveRelative" xpath="//cml:module[@cmlx:templateRef='l4601.virtual']" to="parent::*/parent::*/parent::*"/>

  • pullup - this takes one or more elements defined by an xpath and moves them up out of their current containing element, so that they become children of their current grandparent. The only argument required is the xpath of the nodes to be pulled up.

<transform process="pullup" xpath=".//cml:module[@cmlx:templateRef='l1.version']/cml:*"/>

  • pullupSingleton - this takes one or more elements defined by an xpath and, if the element only has one child, replaces the element with the child, thereby deleting the original element and "pulling up" the child.

<transform process="pullupSingleton" xpath=".//cml:list"/>

  • reparse - TODO

<transform process="reparse" xpath=".//cml:scalar[@id='scraped']" regexPath=".//record[@id='natoms']"/>

  • setDataType - sets the dataType attribute of the nodes in the xpath to the string given as value. This can be useful if
    <record>
    initially extracted data as
    xsd:string
    , that should actually contains data of type
    xsd:integer
    or
    xsd:double
    .

<transform process="setDataType" value="xsd:integer" xpath=".//cml:scalar[@dictRef='x:formalCharge']" />

  • setValue - with just a simple string as an argument (e.g. value="foo") to the value argument, this will set the value of all nodes in the xpath to be equal to this string. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH, then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by xpath, in string or number form. With a map argument ...TODO

<transform process="setValue" xpath=".//cml:list/cml:scalar[2] | .//cml:list/cml:scalar[4] | .//cml:list/cml:scalar[6]" map="//cml:map[@id='variableMap']" value="$string(.)"/>

  • split - this will take the nodes in the xpath and split them (in places) according to their type - a scalar will be split by whitespace and turned into a list, a 1D array will be split into a cml list, and 2D arrays will be split into a list of separate arrays.

<transform process="split" xpath=".//cml:array[@dictRef='cc:mulliken']"/>


Notes on Transforms

  • where possible, id's should always be added to nodesets to facilitate later operations.
  • the < and > symbols should not be used in xpath comparisons, however < and > can be used, as shown below:

<transform process="debugNodes" xpath=".//cml:array[position() > 1 and position() < 4]"/>

  • use "..." for quoting attribute values
  • Rather then use relative namespaces (e.g. g:charge), the more reliable namespace-uri syntax can be used:

foo[@dictRef[namespace-uri()='http://www.xml-cml.org/dict/gaussian' and .='charge']]


JUMBO-Converters filesystem structure

(I guess some of this is standard for Maven projects but my ignorance forces me to document everything. The bright side is that other newbies like me will feel happy!)

The main folder is

jumboconverters-compchem/

Under this is:

jumboconverters-compchem/ jc-compchem-nwchem/

The two most important subfolders of this are

jumboconverters-compchem/ jc-compchem-nwchem/ src/ target/

The second one is where the final compiled Java classes are located (any more stuff?) and we will not care about it for the moment. The
src
subfolder, as its name indicates, contains the source code associated to the compchem part of JUMBOconverters (i.e., the one most related to the Quixote project). Inside the
src
subfolder, we have the following chain of folders, at the bottom of which all Java source code is located:

jumboconverters-compchem/ jc-compchem-nwchem/ src/ main/ java/ org/ xmlcml/ cml/ converters/

Inside
converters
, we have two main subfolders:

jumbo-converters/ jumbo-converters-compchem/ src/ main/ java/ org/ xmlcml/ cml/ converters/ compchem/ marker/

The most specific compchem code is in
compchem
(as you might have guessed!) ordered by the name of the compchem package (
gamessus
,
gaussian
,
nwchem
, etc.), and
marker
contains more general source code to support the former.

If you are Java-savy, you might want to check these folders and read the code, but one of the great things about the declarative approach that PMR has created into JUMBOconverters and we describe in this page is that you don't need to! If you know regular expressions and some very basic XPath (both of which you could even infer from already made examples), that should be sufficient.

One important thing to remember though, even if you don't plan to read the Java source code, is that the above folders structure translates into the names of the classes that do all the magic stuff, so, if you want to call these classes in the command line, like in

mvn -e exec:java -Dexec.mainClass=org.xmlcml.cml.converters.compchem.nwchem.log.NWChemLog2XMLConverter -Dexec.args="./src/test/resources/compchem/nwchem/log/in/test1.out ./test.cml"

you need to have this structure in mind.

The declarative bits of the parsing infrastructure (i.e., what you, parsers developer, will have to check, understand and probably make a version for your favourite compchem code) are inside a similar folder tree under
src/main
:

jumbo-converters/ jumbo-converters-compchem/ src/ main/ resources/ org/ xmlcml/ cml/ converters/ compchem/ amber/ gamessus/ gaussian/ nwchem/ ...

Inside each code folder, one can find subfolders for the different types of file, and inside each one of them a
templates
subfolder, e.g.,

jumbo-converters/ jumbo-converters-compchem/ src/ main/ resources/ org/ xmlcml/ cml/ converters/ compchem/ gaussian/ in/ templates/ log/ templates/ ...

In the rest of the sections and in some of the tutorials, we explain in detail how the different bits of declarative parsing are related and how everything works, but let us mention at this point that, at the filetype folders (i.e., at
in
or
log
) the top level parsing template list file
templateList.xml
can be found, while each one of the smaller templates included in this list are located in
templates
. Now, branching out at the same level as
main
, still inside
src
, we have a
test
subfolder, which contains, on the one hand (under
java
), the Java source code for performing automatic tests, and, on the other hand (under
resources
), a number of example files produced by the compchem codes that Quixote wants to tackle. The scheme of the folder tree is as follows:

jumbo-converters/ jumbo-converters-compchem/ src/ test/ java/ org/ xmlcml/ cml/ converters/ compchem/ amber/ gamessus/ gaussian/ nwchem/ ... resources/ compchem/ amber/ gamessus/ gaussian/ in/ log/ ... nwchem/ ...

A general scheme summarizing all the details commented above is the following:


jumbo-converters/ *** Main JUMBOconverters folder jumbo-converters-compchem/ *** Compchem JUMBOconverters src/ *** Source code and test files main/ *** Source code for the parsing machinery java/ *** Java source code org/ xmlcml/ cml/ converters/ compchem/ marker/ resources/ *** Declarative parsing source code org/ xmlcml/ cml/ converters/ compchem/ amber/ gamessus/ gaussian/ in/ *** Top level parsing directives templates/ *** Subparsers templates log/ templates/ ... nwchem/ ... test/ *** Source code for the automatic testing java/ org/ xmlcml/ cml/ converters/ compchem/ amber/ gamessus/ gaussian/ nwchem/ ... resources/ *** Example test files compchem/ amber/ gamessus/ gaussian/ in/ log/ ... nwchem/ ... target/ *** Compiled classes