Sunday, September 21, 2014

Lower-case prefix notation of tag names for MolDifML leaf nodes

Tag names of MolDifML leaf nodes are prefixed by one or more lower-case letters to indicate the data type structure of the node's content. For example, the tag name urlAffiliation hints on node content being a unique resource locator (url) . The tag name pmStdDev encodes the standard deviation associated with an experimental value:

<pmStdDev>0.035</pmStdDev>  encodes  ± 0.035  within standard-deviation context.

The tag name sCurlySMILES captures a molecular structure or architecture, for which the basic conceptual encoding framework has been defined in an open-source J. Chem. Inf. article and is maintained and enhanced by the CurlySMILES Project at Axeleratio.

A MolDifML leaf node is a child node of a MolDifML element node. A leaf node itself contains a text node, but cannot contain other element nodes. The character of the text node content is indicated by the use of prefixes in the tag name of a leaf node. Here is the list of  MolDifML prefix notations along with a description of what type of node content they designate:
  • date: a Gregorian calendar date in the format CCYY-MM-DD, where CC represents the century, YY the year, MM the month and DD the day.
  • e: an enumeration term to be selected from a fixed set of predefined notations.
  • ge: a numerical value associated with a leading  greater-than-or-equal sign ().
  • gt: a numerical value associated with a leading greater-than sign (>).
  • key: a string representing a unique identifier that serves as an internal key within a MolDifML instance (file) to reference, for example, extracted data to a <Citation> block or to cross-link property values.
  • le: a numerical value associated with a leading less-than-or-equal sign ().
  • lt:a numerical value associated with a leading less-than sign (<).
  • n: a numerical value including whole numbers and floating-point numbers.
  • pm: a numerical value associated with a leading plus-minus sign (±), which indicates that the value represents, depending on context,  an experimental range; more precisely a standard deviation, standard error or confidence interval. 
  • s: a sequence of characters (string) that can legally be hold inside an XML text node.
  • url: a string representing a Web address.
  • yr: a four-digit-long whole number representing a year.
The one-letter prefixes e, n and s designate enumeration, numerical-value and string element nodes. The other prefixes associate elements with node content that belongs to one of these types, but has a very specific meaning within the context of a particular MolDifML instance (MolDifML file).

The prefixes dat, e, n, s, url, yr also apply to tag names in ThermoML instances. MolDifML, additionally, employs ge, gt, le, lt and pm for numerical values in special contexts. The prefix key supports data cross-linking within the scope of a particular MolDifML document.

Sunday, September 14, 2014

MolDifML: capturing and applying molecular similarity with respect to its molecular-pair difference correlating molecular structure and chemical property values

MolDifML is an XML-based standard, currently developed by Axeleratio, for the representation of differences between molecular structures and related properties.

The key concept for designing MolDifML is that a formally recognized and captured difference between two molecules can be associated with differences in respective physical and chemical property values. When considering a substructurally additive (atom or group additive) property, we expect pairs of molecules, assuming they exhibit the same structural difference, to share the same—or approximately the same—difference value for that property. Derivation of such difference values, based on the rapidly growing pool of experimental property data available today, is envisioned for in silico property estimation and virtual compound and materials design as well as informed intermolecular navigation and rational cross-validation of chemical property data.  

Group additivity methods have been known for a diverse spectrum of physicochemical and environmental-risk-related properties for some time: a property value can be calculated or approximated by addition of group contributions (group increments) including all the groups (submolecular parts of a molecule) that constitute a molecule. Instead of addition, a more complex functional computation may apply to predict property values from contributions, including corrections for certain group arrangements and interactions. Group contributions are derived statistically—or via an artificial neural network approach—from a set of molecular compounds, for which experimental values of the property of interest have been published and compiled.

In 1993, Axel Drefahl and Martin Reinhard explored the combination of group additivity and molecular similarity. They developed a linear notation to capture and represent molecular structure differences and demonstrated their formal intermolecular group exchange methodgroup interchange method (GIM) for short—to the systematic comparison and prediction of  logKow for organic compounds [1,2].

Axeleratio is now initiating MolDifML as a standard to store and exchange information on GIM-related chemical compounds and their properties. The present MolDifML Blog has been created to discuss the MolDifML implementation and to receive critical comments and constructive suggestions.

The primary purpose of MolDifML is to present and communicate property information of virtual (not yet synthesized) and data-deficient compounds. A compound- and property-specific MolDifML file will abstract a compound of interest, named query compound, along with a collection of molecularly similar compounds, named candidate compounds. An MolDifML file will encode experimental property data for the candidates and estimated property values for the query, including method and property reference information.

Appropriately generated and maintained MolDifML files are expected to not only enhance chemical search and fill property data gaps, but to provide a robust network of unambiguously interrelated molecular structures for property-based screening and molecular design. 

Keywords: cheminformatics, virtual chemistry, chemical property estimation, molecular difference, XML.

References and more to explore
[1] A. Drefahl and M. Reinhard: Similarity-based Search and Evaluation of Environmentally Relevant Properties for Organic Compounds in Combination with the Group Contribution Approach. J. Chem. Inf. Comput. Sci. 1993, 33, pp. 886-895. DOI: 10.1021/ci00016a011.
[2] Similarity-Based and Group Interchange Models, pp. 16-21, in M. Reinhard and A. Drefahl:  Handbook for Estimating Physicochemical Properties of Organic Compounds. John Wiley & Sons, Inc., New York, 1999.