Class FunctionalGroupsFinder


  • public class FunctionalGroupsFinder
    extends Object
    Finds and extracts a molecule's functional groups in a purely rule-based manner (it is not a classical functional group identification functionality based on substructure matching!). This class implements Peter Ertl's algorithm for the automated detection and extraction of functional groups in organic molecules ( [Ertl P. An algorithm to identify functional groups in organic molecules. J Cheminform. 2017; 9:36.]) and has been described in a scientific publication ([Fritsch, S., Neumann, S., Schaub, J. et al. ErtlFunctionalGroupsFinder: automated rule-based functional group detection with the Chemistry Development Kit (CDK). J Cheminform. 2019; 11:37.]).
    In brief, the algorithm iterates through all atoms in the input molecule and marks hetero atoms and specific carbon atoms (i.a. those in non-aromatic double or triple bonds etc.) as being part of a functional group. Connected groups of marked atoms are extracted as individual functional groups, together with their unmarked, "environmental" carbon atoms. These environments can be important, e.g. to differentiate an alcohol from a phenol, but are less important in other cases.
    To account for this, Ertl also devised a "generalization" scheme that generalizes the functional group environments in a way that accounts for their varying significance in different cases. Most environmental atoms are exchanged with pseudo ("R") atoms there. All these functionalities are available in FunctionalGroupsFinder. Additionally, only the marked atoms, completely without their environments, can be extracted.
    To apply functional group detection to an input molecule, its atom types need to be set and aromaticity needs to be detected beforehand:
    
     //Prepare input
     SmilesParser smiPar = new SmilesParser(SilentChemObjectBuilder.getInstance());
     IAtomContainer inputMol = smiPar.parseSmiles("C[C@@H]1CN(C[C@H](C)N1)" +
             "C2=C(C(=C3C(=C2F)N(C=C(C3=O)C(=O)O)C4CC4)N)F"); //PubChem CID 5257
     AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(inputMol);
     Aromaticity aromaticity = new Aromaticity(Aromaticity.Model.CDK_1x,
             Cycles.cdkAromaticSet());
     aromaticity.apply(inputMol);
     //Identify functional groups
     FunctionalGroupsFinder fgFinder = FunctionalGroupsFinder.withGeneralEnvironment();
     List<IAtomContainer> functionalGroupsList = fgFinder.extract(inputMol);
     
    If you want to only identify functional groups in standardised, organic structures, FunctionalGroupsFinder can be configured to only accept molecules that do *not* contain any metal, metalloid, or pseudo (R) atoms or formal charges.
    Also structures consisting of more than one unconnected component (e.g. ion and counter-ion) are not accepted if(!) the strict input restrictions are turned on (they are turned off by default). This can be done via a boolean parameter in a variant of the central extract(org.openscience.cdk.interfaces.IAtomContainer) method or pre-checked using checkConstraints(org.openscience.cdk.interfaces.IAtomContainer). Please note that structural properties like formal charges and the others mentioned above are not expected to cause issues (exceptions) when processed by this class, but they are not explicitly regarded by the Ertl algorithm and hence this implementation, too. They might therefore cause unexpected behavior in functional group identification. For example, a formal charge is not listed as a reason to mark a carbon atom and pseudo atoms are simply ignored.
    To identify molecules that do not fulfill these constraints and should be filtered or preprocessed/standardised, you can use CDK utilities like the ConnectivityChecker class, utility methods in the Elements class, and query IAtom instances for their formal charge. Pseudo atoms can be detected in multiple ways, e.g. by checking for atomic numbers equal to 0 or checking instanceof IPseudoAtom.
    Author:
    Sebastian Fritsch, John Mayfield, Jonas Schaub
    • Method Detail

      • withGeneralEnvironment

        public static FunctionalGroupsFinder withGeneralEnvironment()
        Constructs a new FunctionalGroupsFinder instance with generalization of returned functional groups turned ON.
        Returns:
        new FunctionalGroupsFinder instance that generalizes returned functional groups
      • withFullEnvironment

        public static FunctionalGroupsFinder withFullEnvironment()
        Constructs a new FunctionalGroupsFinder instance with generalization of returned functional groups turned OFF. The FG will have their full environments.
        Returns:
        new FunctionalGroupsFinder instance that does NOT generalize returned functional groups
      • withNoEnvironment

        public static FunctionalGroupsFinder withNoEnvironment()
        Constructs a new FunctionalGroupsFinder instance that extracts only the marked atoms of the functional groups, no attached environmental atoms.
        Returns:
        new FunctionalGroupsFinder instance that extracts only marked atoms
      • extract

        public List<IAtomContainer> extract​(IAtomContainer mol)
        Find all functional groups in a molecule. The strict input restrictions (no charged atoms, pseudo atoms, metals, metalloids or unconnected components) do not apply by default. They can be turned on again in another variant of this method below. The returned (marked) functional group atoms will be copies of the input molecule atoms and their environmental carbon atoms will be new atom instances.
        Parameters:
        mol - the molecule to identify functional groups in
        Returns:
        a list with all functional groups found in the molecule
        See Also:
        extract(IAtomContainer, boolean)
      • find

        public int find​(int[] funGroups,
                        IAtomContainer mol)
        Find all functional groups in a molecule and extract them as group indices placed in the provided atom index array. This allows you to, for example, generate SMILES strings with functional group annotations or depictions with functional group highlights, e.g.:
        
         int[] groups = new int[mol.getAtomCount()];
         fgf.find(groups, mol);
         for (IAtom atom : mol.atoms())
           atom.setMapIdx(groups[atom.getIndex()]+1);
         String smi = new SmilesGenerator(SmiFlavor.AtomAtomMap).create(mol);
         //example output (for PubChem CID 118705975): 
         // CC1=C(C(=CC=C1)[NH:1]C2=CC=CC=C2[C:2](=[O:2])[NH:2]C(CC[S:3](=[O:3])C)[C:4](=[O:4])[NH:4]C(C)C3=CC=C(C=C3)[F:5])C
         
        (Check out the "Color Map" option on the CDK depict web app).
        NOTE: this method extracts only the atoms of each functional group that are marked according to the Ertl algorithm, environmental carbon atoms are disregarded here, independent of the environment setting.
        Parameters:
        funGroups - int array that is at least as large as the number of atoms in the given molecule; elements at the individual atom indices will be set to a functional group number (starting at 0) or -1 if the respective atom is not part of a functional group
        mol - the molecule to identify functional groups in
        Returns:
        the number of functional groups found
        Throws:
        IllegalArgumentException - if the given int array is smaller than the number of atoms in the given molecule
      • extract

        public List<IAtomContainer> extract​(IAtomContainer mol,
                                            boolean strict)
        Find all functional groups in a molecule.
        Parameters:
        mol - the molecule to identify functional groups in
        strict - if true, the input must consist of one connected structure and must not contain charged atoms, pseudo atoms, metals or metalloids; if the input molecule is affected by one of these constraints, an empty list is returned
        Returns:
        a list with all functional groups found in the molecule
        See Also:
        checkConstraints(IAtomContainer), extract(IAtomContainer)
      • checkConstraints

        public static boolean checkConstraints​(IAtomContainer mol)
        Checks input molecule for formal charges, metal or metalloid atoms, pseudo (R) atoms, and multiple unconnected structures. The molecule may be empty (returns true) but not null.
        Parameters:
        mol - the molecule to check
        Returns:
        false if the molecule contains charged atoms, metal or metalloid atoms, pseudo (R) atoms, or multiple unconnected structures; true if all these constraints do not apply to it