A new approach to validation
CML has previously been defined using DTDs and schemas. As part of the Chem4Word project a new approach to the validation of Chemical Markup Langauge (CML) has been taken. The validation is now performed by a series of Schema, XSLT and code. Each step of the validation process puts progressively tighter restrictions on the structure and content of the document.
For further information see the CMLLite: a design philosophy for CML paper in the "Visions of a Semantic Molecular Future" thematic issue of the Journal of Cheminformatics.
CMLLite Schema - Vocabulary
We are currently using Schema 3 for the schema validation step of the CMLLite process. This schema is based on the long stable Schema 2.4 but the content model has been largely removed and deprecated elements and attributes have also been removed. We are still cleaning up the schema (espcially the documentation) and intend to denormalise the attributes.
The content model determines what types of content element and attributes can hold. Previously the Schema allowed mixed content (both text and elements) for some elements; this is no longer allowed. Elements are specified as being allowed either no content, text content or element content. In some cases the text content or element content is mandatory in others it is optional. Any element which is allowed element content may now hold any other CML element or any element from a different namespace.
Schema 3 has added the explicit unknown
value to many enumerations. In Schema 2.4 elements and
attributes specified an empty string as allowed content which was interpretted to mean unknown or unspecified.
This has been replaced by the string value unknown
. This allows the absence of the element or
attribute to be interpretted as unspecified.
Previous schemas allowed values to be either from an enumeration or a
QName. For example the list below shows
some of the allowed values of the order
attribute on a bond
.
S
(single)D
(bond)T
(triple)A
(aromatic)QName
allows users to point to another bond type which is not supported by CML
The order
attribute could therefore have either string or QName content. Schema 3 now specifies
the allowed values as:
S
(single)D
(bond)T
(triple)A
(aromatic)unknown
- an unambiguous statement that the order of the bond is unknownother
- any other bond order. ThedictRef
attribute may be used to add further information.
Conventions and Constraints - Grammar
Different domains of chemistry think about chemistry differently; often this means a very tight specification
of rules in your area of expertise and very little if any applied to the rest. The loosening of the
content model in Schema 3 allows users to combine the elements and attributes as they need to represent data.
However, users still need to be able to specify a set of rules (constraints) which model their particular
domain. This can be likened to thinking of the elements and attributes of CML as representing the allowed
vocabularly and the set of rules as a grammar specifying how these words are allowed to be put together.
The entire set of constraints which the CML should conform to is called a convention
. There are
currently three well developed conventions molecular
, compchem
and
dictionary
.
Constraints are defined by using XSL Transformations (XSLT). These allow users to put more specific constraints and co-constraints on the allowed structure of the CML documents than using only schemas. We an output based on the ISO Schematron standard XML report language SVRL (Schematron Validation Report Language) to indicate errors and warnings in the document. A major advantage of this approach is that all the errors and warnings are reported rather than the validation process stopping as soon as the first error has been found.
Examples of constraints implemented in the molecular
convention are:
- an
atomArray
must have at least oneatom
child - the value of an
atom
sid
must be unique within the eldest containing molecule - a
bond
element must have anatomRefs2
attribute - a
bond
must be betweenatom
s within the same molecule
There are also examples of files which conform to the various levels of validation available with some explanation here.