The UniversalIsomorphismTester
class in the CDK can be used
for substructure searching. It allows you to determine if some
structure is a substructure and what the
matching substructures are. As such, this can also be used to determine
if two structures are identical.
In this chapter we will see how the class returns all possible substructure matches, and we’ll notice that redundancy occurs due to symmetrically equivalent matches, and how these redundant matches can be removed.
Before we look at isomorphism checking, it should be noted that canonical SMILES and InChI generation (see elsewhere in this book) may be an appropriate way to determine identity and use this for database lookup.
But historically, this below approach was the way to go and is included are prelude to the substructure search.
The UniversalIsomorphismTester
class implements an algorithm
that was originally developed for isomorphism checking.
However, it can be used for substructure search too.
This section will first show how the class is used to
check if two classes are identical:
Script 20.1 code/Isomorphism.groovy
butane = MoleculeFactory.makeAlkane(4);
isomorphismTester = new UniversalIsomorphismTester()
println "Is isomorphic: " +
isomorphismTester.isIsomorph(
butane, butane
)
This algorithm works by looking the how bonds are connected to each other. This is important to realize, because it explains a typical corner case for this algorithm: it cannot distinguish cyclopropane from isobutane (see Figure 17.1) when they are hydrogen depleted:
Script code/UITLimitation.groovy
isomorphismTester = new UniversalIsomorphismTester()
println "Is isomorphic: " +
isomorphismTester.isIsomorph(
cyclopropane, isobutane
)
Fortunately, the CDK implementation has a workaround for this so that they are still considered different, based on the fact that they have different atom counts:
Is isomorphic: false
However, for substructure searching we’re less lucky, as we will see shortly.
Figure 17.1: Cyclopropane (left) and isobutane (right).
Starting from the above code to match two structures, the step to substructure searching
is made via the isSubgraph()
method:
Script code/IsSubgraph.groovy
butane = MoleculeFactory.makeAlkane(4);
propane = MoleculeFactory.makeAlkane(3);
isomorphismTester = new UniversalIsomorphismTester()
println "Propane part of Butane: " +
isomorphismTester.isSubgraph(
butane, propane
)
println "Butane part of Propane: " +
isomorphismTester.isSubgraph(
propane, butane
)
It gives this output:
Propane part of Butane: true
Butane part of Propane: false
Now, you may wonder why propane is a subgraph of butane, because it is
indeed not. But while the variable names suggest that that is what we have been testing,
we have been testing something else: this code works because of the fact that the MoleculeFactory
returns hydrogen depleted graphs (see Section ??).
Therefore, butane is a chain of four carbons, and propane is a chain
of three carbons. Then, the latter is a chemical subgraph of the
former.
If we now return to our previous cyclopropane-isobutane example, we can run a subgraph analysis on them too:
Script code/UITSubgraphLimitation.groovy
isomorphismTester = new UniversalIsomorphismTester()
println "Cyclopropane part of isobutane: " +
isomorphismTester.isSubgraph(
isobutane, cyclopropane
)
println "Isobutane part of cyclopropane: " +
isomorphismTester.isSubgraph(
cyclopropane, isobutane
)
Here we do see the intrinsic limitation of the algorithm reflected. While it is possible to see that isobutane has more atoms then cyclobutane and therefore cannot be a substructure, that conclusion cannot be derived for cyclobutane as substructure as isobutane, visualizing that algorithmic limitation:
Cyclopropane part of isobutane: true
Isobutane part of cyclopropane: false
Substructure searching is finding in a target molecule the atoms that
match the given searched substructure. With the UniversalIsomorphismTester
we can do:
Script code/Overlap.groovy
butane = MoleculeFactory.makeAlkane(4)
ccc = MoleculeFactory.makeAlkane(3)
isomorphismTester = new UniversalIsomorphismTester()
hits = isomorphismTester.getOverlaps(
butane, ccc
)
println "Number of hits: " + hits.size()
hits.each { substructure ->
println "Substructure in AtomContainer:"
println " #atoms: " + substructure.atomCount
}
However, this only returns us one match, selected as being the largest:
Number of hits: 1
Substructure in AtomContainer:
#atoms: 3
There is an alternative:
Script code/Substructure.groovy
butane = MoleculeFactory.makeAlkane(4);
ccc = MoleculeFactory.makeAlkane(3);
isomorphismTester = new UniversalIsomorphismTester()
hits = isomorphismTester.getSubgraphAtomsMaps(
butane, ccc
)
println "Number of hits: " + hits.size()
hits.each { substructure ->
println "Atoms in substructure: " +
substructure.size()
}
The getSubgraphAtomsMaps()
methods returns a List<List<RMap>>
object, where each List<RMap>
represents on substructure match.
When we look at the outer list, we see that the subgraph of three carbon atoms
is found 4 times in butane, each with 3 atoms:
Number of hits: 4
Atoms in substructure: 3
Atoms in substructure: 3
Atoms in substructure: 3
Atoms in substructure: 3
This is caused by the symmetrical nature of the substructure. It can map twice onto the same three atoms in butane: once in the forward direction, and once in the backward direction.
A common method to find substructures in cheminformatics is the
SMiles ARbitrary Target Specification (SMARTS). The CDK has a
SMARTSParser
class to parse SMARTS strings and a convenience tool to perform
SMARTS substructure searches. This is a typical use case:
Script code/SMARTSSearching.groovy
atomContainer = sp.parseSmiles("CC(=O)OC(=O)C");
querytool = new SMARTSQueryTool(
"O=CO", atomContainer.getBuilder()
);
found = querytool.matches(atomContainer);
if (found) {
int nmatch = querytool.countMatches();
mappings = querytool.getMatchingAtoms();
for (int i = 1; i <= nmatch; i++) {
atomIndices = mappings.get(i-1);
println "match $i: $atomIndices"
}
}
This shows us that the SMARTS-encoded carboxylic acid substructure is found twice and which atoms in the input structure form that match:
match 1: [2, 1, 3]
match 2: [5, 4, 3]
Symmetry can cause identical hits to match multiple times, in different ways. This is, for example, the case when we loosen the above substructure search to a carbon connected to two oxygens, whatever the bond order is:
Script code/SMARTSUniqueSearching.groovy
atomContainer = sp.parseSmiles("CC(=O)OC(=O)C");
querytool = new SMARTSQueryTool(
"O~C~O", atomContainer.getBuilder()
);
found = querytool.matches(atomContainer);
if (found) {
mappings = querytool.getMatchingAtoms();
for (int i = 1; i <= mappings.size(); i++) {
atomIndices = mappings.get(i-1);
println "match $i: $atomIndices"
}
mappings = querytool.getUniqueMatchingAtoms();
for (int i = 1; i <= mappings.size(); i++) {
atomIndices = mappings.get(i-1);
println "unique match $i: $atomIndices"
}
}
This shows the different between the getMatchingAtoms
and getUniqueMatchingAtoms
method:
match 1: [2, 1, 3]
match 2: [3, 1, 2]
match 3: [3, 4, 5]
match 4: [5, 4, 3]
unique match 1: [2, 1, 3]
unique match 2: [3, 4, 5]
Substructure searching is a relatively slow algorithm, and the time required to compare two molecules scales with the number of atoms in each molecule. To reduce the computation time, molecular fingerprints were invented. There are two key aspects to fingerprints that make them efficient: first, they have a fixed length so that the time to compare two molecule is independent of the size of the two structures; secondly, the fingerprint of a substructure always matches the fingerprint of any molecules that has that substructure.
In this section we will see two fingerprint types available in the CDK:
a substructure based fingerprint, and a path based fingerprint.
Before I will explain how these fingerprints are created, we will first
look at the BitSet
class that is used by the CDK to
represent these fingerprints. Consider this code:
Script 15.1 code/BitSetDemo.groovy
bitset = new BitSet(10);
println "Empty bit set: $bitset";
bitset.set(3);
bitset.set(7);
println "Two bits set: $bitset";
If we analyze the output, we see that all set bits are listed, and that all other bits are not:
Empty bit set: {}
Two bits set: {3, 7}
Let us now consider a simple substructure fingerprint of length four with the following bit definitions:
Let’s call this fingerprinter SimpleFingerprinter
:
Script 15.2 code/SimpleFingerprinter.java
public class SimpleFingerprinter implements IFingerprinter {
Map<String,Integer> map = new HashMap<String,Integer>() ;
public BitSet getFingerprint(IAtomContainer molecule) {
BitSet bitSet = new BitSet(getSize());
for (IAtom atom : molecule.atoms()) {
if (map.containsKey(atom.getSymbol()))
bitSet.set(map.get(atom.getSymbol()));
}
return bitSet;
}
public Map<String,Integer> getRawFingerprint(
IAtomContainer molecule
) {
Map<String,Integer> fingerprint =
new HashMap<String,Integer>();
for (String key : map.keySet()) {
fingerprint.put(key, 0);
}
for (IAtom atom : molecule.atoms()) {
int count = map.get(atom.getSymbol());
fingerprint.put(atom.getSymbol(), count+1);
}
return fingerprint;
}
public ICountFingerprint getCountFingerprint(
IAtomContainer molecule
) throws CDKException {
return new IntArrayCountFingerprint(
getRawFingerprint(molecule)
);
}
public IBitFingerprint getBitFingerprint(
IAtomContainer molecule
) throws CDKException {
return new BitSetFingerprint(
getFingerprint(molecule)
);
}
public int getSize() {
return 4;
}
public String getVersionDescription() { return ""; }
}
We can then calculate the fingerprints for ethanol and benzene:
Script 15.3 code/SimpleFingerprintDemo.groovyl
fingerprinter = new SimpleFingerprinter();
println "ethanol: " + fingerprinter.getFingerprint(ethanol)
println "benzene: " + fingerprinter.getFingerprint(benzene)
and we get these bit sets:
ethanol: {1, 3}
benzene: {1}
Now, we can replace the presence of a particular atom, by the presence of a substructure, such as a phenyl or a carbonyl group. We have then defined a substructure fingerprint.
The CDK has several kinds of fingerprints, including path-based
fingerprints (Fingerprinter
and HybridizationFingerprinter
), a MACSS fingerprint
(MACSSFingerprinter
) [1], and the PubChem fingerprint
(PubChemFingerprinter
).
These fingerprints have been used for various tasks, including ligand
classification [2], and databases like BRENDA [3] and TIN [4].
One substructure-based fingerprinter is the MACCSFingerprinter
which partly implements the MACSS fingerprint specification [1]. The
substructures are defined as SMARTS substructure specifications,
inherited from RDKit (http://rdkit.org/). For this fingerprint it is required the implicit hydrogen
counts are first set:
Script 15.4 code/MACCSFingerprint.groovy
fingerprinter = new MACCSFingerprinter();
println "ethanol: " +
fingerprinter.getBitFingerprint(ethanol).
asBitSet()
The object returned by the getBitFingerprint
method is the IBitFingerprint
which we can convert into a Java BitSet
with the asBitSet
method:
ethanol: {81, 108, 113, 138, 152, 154, 156, 159, 163}
The CDK also has an implementation for the circular ECFP and FCFP
fingerprints [5]. These are developed by Alex M. Clark at
Collaborative Drug Discovery, Inc in the
CircularFingerprinter
[6].
It implements both in four variants:
ECFP0, ECFP2, ECFP4, ECFP6, FCFP0, FCFP2, FCFP4, and FCFP6. The code is quite similar
as for other fingerprints, but we do have to indicate what variant we want:
Script 15.5 code/ECFPFingerprint.groovy
fingerprinter = new CircularFingerprinter(
CircularFingerprinter.CLASS_ECFP6
);
println "ethanol: " +
fingerprinter.getBitFingerprint(ethanol).
asBitSet()
Again we get an IBitFingerprint
resulting in a BitSet
of bits:
ethanol: {152, 325, 625, 740, 947, 993}