A new approach to validation

CML has previously been defined using DTDs and schemas. As part of the Chem4Word project a new approach to the validation of Chemical Markup Langauge (CML) has been taken. The validation is now performed by a series of Schema, XSLT and code. Each step of the validation process puts progressively tighter restrictions on the structure and content of the document.

For further information see the CMLLite: a design philosophy for CML paper in the "Visions of a Semantic Molecular Future" thematic issue of the Journal of Cheminformatics.

CMLLite Schema - Vocabulary

We are currently using Schema 3 for the schema validation step of the CMLLite process. This schema is based on the long stable Schema 2.4 but the content model has been largely removed and deprecated elements and attributes have also been removed. We are still cleaning up the schema (espcially the documentation) and intend to denormalise the attributes.

The content model determines what types of content element and attributes can hold. Previously the Schema allowed mixed content (both text and elements) for some elements; this is no longer allowed. Elements are specified as being allowed either no content, text content or element content. In some cases the text content or element content is mandatory in others it is optional. Any element which is allowed element content may now hold any other CML element or any element from a different namespace.

Schema 3 has added the explicit unknown value to many enumerations. In Schema 2.4 elements and attributes specified an empty string as allowed content which was interpretted to mean unknown or unspecified. This has been replaced by the string value unknown. This allows the absence of the element or attribute to be interpretted as unspecified.

Previous schemas allowed values to be either from an enumeration or a QName. For example the list below shows some of the allowed values of the order attribute on a bond.

The order attribute could therefore have either string or QName content. Schema 3 now specifies the allowed values as:

Conventions and Constraints - Grammar

Different domains of chemistry think about chemistry differently; often this means a very tight specification of rules in your area of expertise and very little if any applied to the rest. The loosening of the content model in Schema 3 allows users to combine the elements and attributes as they need to represent data. However, users still need to be able to specify a set of rules (constraints) which model their particular domain. This can be likened to thinking of the elements and attributes of CML as representing the allowed vocabularly and the set of rules as a grammar specifying how these words are allowed to be put together. The entire set of constraints which the CML should conform to is called a convention. There are currently three well developed conventions molecular, compchem and dictionary.

Constraints are defined by using XSL Transformations (XSLT). These allow users to put more specific constraints and co-constraints on the allowed structure of the CML documents than using only schemas. We an output based on the ISO Schematron standard XML report language SVRL (Schematron Validation Report Language) to indicate errors and warnings in the document. A major advantage of this approach is that all the errors and warnings are reported rather than the validation process stopping as soon as the first error has been found.

Examples of constraints implemented in the molecular convention are:

There are also examples of files which conform to the various levels of validation available with some explanation here.