Class Bayesian

java.lang.Object
org.openscience.cdk.fingerprint.model.Bayesian

public class Bayesian extends Object
  • Bayesian models using fingerprints: provides model creation, analysis, prediction and serialisation.
  • Uses a variation of the classic Bayesian model, using a Laplacian correction, which sums log values of ratios rather than multiplying them together. This is an effective way to work with large numbers of fingerprints without running into extreme numerical precision issues, but it also means that the outgoing predictor is an arbitrary value rather than a probability, which introduces the need for an additional calibration step prior to interpretation.
  • For more information about the method, see: J. Chem. Inf. Model, v.46, pp.1124-1133 (2006) J. Biomol. Screen., v.10, pp.682-686 (2005) Molec. Divers., v.10, pp.283-299 (2006)
  • Currently only the CircularFingerprinter fingerprints are supported (i.e. ECFP_n and FCFP_n).
  • Model building is done by selecting the fingerprinting method and folding size, then providing a series of molecules & responses. Individual model contributions are kept around in order to produce the analysis data (e.g. the ROC curve), but is discarded during serialise/deserialise cycles.
  • Fingerprint "folding" is optional, but recommended, because it places an upper limit on the model size. If folding is not used (folding=0) then the entire 32-bits are used, which means that in the diabolical case, the number of Bayesian contributions that needs to be stored is 4 billion. In practice the improvement in predictivity tends to plateaux out at around 1024 bits, so values of 2048 or 4096 are generally safe. Folding values must be integer powers of 2.
Author:
am.clark
Source code:
main
Belongs to CDK module:
standard
Keywords:
fingerprint, bayesian, model
Created on:
2015-01-05
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected final ArrayList<Boolean>
     
    protected final Map<Integer,Double>
     
    protected double[]
     
    protected double
     
    protected final Map<Integer,int[]>
     
    protected double
     
    protected double
     
    protected double
     
    protected double
     
    protected String
     
    protected float[]
     
    protected float[]
     
    protected final ArrayList<int[]>
     
    protected int
     
    protected int
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    Bayesian(int classType)
    Instantiate a Bayesian model with no data.
    Bayesian(int classType, int folding)
    Instantiate a Bayesian model with no data.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    addMolecule(IAtomContainer mol, boolean active)
    Appends a new row to the model source data, which consists of a molecule and whether or not it is considered active.
    void
    Performs that Bayesian model generation, using the {molecule:activity} pairs that have been submitted up to this point.
    void
    Clears out the training set, to free up memory.
    static Bayesian
    Reads the incoming stream and attempts to convert it into an instantiated model.
    static Bayesian
    Converts a given string into a Bayesian model instance, or throws an exception if it is not valid.
    int
    Access to the fingerprint type.
    int
    Access to the fingerprint folding extent.
    Returns the optional comments, which is a list of arbitrary text strings.
    Returns the optional description of the source for the model.
    Returns the optional title used to describe the model.
    double
    Returns the integral of the area-under-the-curve of the receiver-operator-characteristic.
    Returns a string description of the method used to create the ROC curve (e.g.
    float[]
    Returns X-values that can be used to plot the ROC-curve.
    float[]
    Returns Y-values that can be used to plot the ROC-curve.
    int
    Returns the number of actives in the training set that was used to create the model.
    int
    Returns the size of the training set, i.e.
    double
    For a given molecule, determines its fingerprints and uses them to calculate a Bayesian prediction.
    double
    scalePredictor(double pred)
    Converts a raw Bayesian prediction and transforms it into a probability-like range, i.e.
    Converts the current model into a serialised string representation.
    void
    setNoteComments(String[] comments)
    Sets the comments for the model, which is a list of strings containing arbitrary content.
    void
    Provides an arbitrary string that briefly describes the model origin, which may include authors, data source keywords, or other pertinent information.
    void
    Provides an arbitrary title string that briefly summarises the model.
    void
    setPerceiveStereo(boolean val)
    Sets whether stereochemistry should be re-perceived from 2D/3D coordinates.
    void
    Produces a ROC validation set by partitioning the inputs into 5 groups, and performing five separate 80% in/20% out model simulations.
    void
    Produces an ROC validation set, using the inputs provided prior to the model building, using leave-one-out.
    void
    Produces a ROC validation set by partitioning the inputs into 3 groups, and performing three separate 66% in/33% out model simulations.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • inHash

      protected final Map<Integer,int[]> inHash
    • training

      protected final ArrayList<int[]> training
    • activity

      protected final ArrayList<Boolean> activity
    • contribs

      protected final Map<Integer,Double> contribs
    • lowThresh

      protected double lowThresh
    • highThresh

      protected double highThresh
    • range

      protected double range
    • invRange

      protected double invRange
    • estimates

      protected double[] estimates
    • rocX

      protected float[] rocX
    • rocY

      protected float[] rocY
    • rocType

      protected String rocType
    • rocAUC

      protected double rocAUC
    • trainingSize

      protected int trainingSize
    • trainingActives

      protected int trainingActives
  • Constructor Details

    • Bayesian

      public Bayesian(int classType)
      Instantiate a Bayesian model with no data.
      Parameters:
      classType - one of the CircularFingerprinter.CLASS_* constants
    • Bayesian

      public Bayesian(int classType, int folding)
      Instantiate a Bayesian model with no data. * @param classType one of the CircularFingerprinter.CLASS_* constants
      Parameters:
      folding - the maximum number of fingerprint bits, which must be a power of 2 (e.g. 1024, 2048) or 0 for no folding
  • Method Details

    • setPerceiveStereo

      public void setPerceiveStereo(boolean val)
      Sets whether stereochemistry should be re-perceived from 2D/3D coordinates. By default stereochemistry encoded as IStereoElements are used.
      Parameters:
      val - perceived from 2D
    • getClassType

      public int getClassType()
      Access to the fingerprint type.
      Returns:
      fingerprint class, one of CircularFingerprinter.CLASS_*
    • getFolding

      public int getFolding()
      Access to the fingerprint folding extent.
      Returns:
      folding extent, either 0 (for none) or a power of 2
    • addMolecule

      public void addMolecule(IAtomContainer mol, boolean active) throws CDKException
      Appends a new row to the model source data, which consists of a molecule and whether or not it is considered active.
      Parameters:
      mol - molecular structure, which must be non-blank
      active - whether active or not
      Throws:
      CDKException
    • build

      public void build() throws CDKException
      Performs that Bayesian model generation, using the {molecule:activity} pairs that have been submitted up to this point. Once this method has finished, the object can be used to generate predictions, validation data or to serialise for later use.
      Throws:
      CDKException
    • predict

      public double predict(IAtomContainer mol) throws CDKException
      For a given molecule, determines its fingerprints and uses them to calculate a Bayesian prediction. Note that this value is unscaled, and so it only has relative meaning within the confines of the model, i.e. higher is more likely to be active.
      Parameters:
      mol - molecular structure which cannot be blank or null
      Returns:
      predictor value
      Throws:
      CDKException
    • scalePredictor

      public double scalePredictor(double pred)
      Converts a raw Bayesian prediction and transforms it into a probability-like range, i.e. most values within the domain are between 0..1, and assigning a cutoff of activie = scaled_prediction > 0.5 is reasonable. The transform (scale/translation) is determined by the ROC-analysis, if any. The resulting value can be used as a probability by capping the values so that 0 ≤ p ≤ 1.
      Parameters:
      pred - raw prediction, as provided by the predict(..) method
      Returns:
      scaled prediction
    • validateLeaveOneOut

      public void validateLeaveOneOut()
      Produces an ROC validation set, using the inputs provided prior to the model building, using leave-one-out. Note that this should only be used for small datasets, since it is very thorough, and scales as O(N^2) relative to training set size.
    • validateFiveFold

      public void validateFiveFold()
      Produces a ROC validation set by partitioning the inputs into 5 groups, and performing five separate 80% in/20% out model simulations. This is quite efficient, and takes approximately 5 times as long as building the original model: it should be used for larger datasets.
    • validateThreeFold

      public void validateThreeFold()
      Produces a ROC validation set by partitioning the inputs into 3 groups, and performing three separate 66% in/33% out model simulations. This is quite efficient, and takes approximately 3 times as long as building the original model: it should be used for larger datasets.
    • clearTraining

      public void clearTraining()
      Clears out the training set, to free up memory.
    • getTrainingSize

      public int getTrainingSize()
      Returns the size of the training set, i.e. the total number of molecules used to create the model.
      Returns:
      training set size
    • getTrainingActives

      public int getTrainingActives()
      Returns the number of actives in the training set that was used to create the model.
      Returns:
      actives in training set
    • getROCAUC

      public double getROCAUC()
      Returns the integral of the area-under-the-curve of the receiver-operator-characteristic. A value of 1 means perfect recall, 0.5 is pretty much random.
      Returns:
      ROC area under the curve, between 0 and 1
    • getROCType

      public String getROCType()
      Returns a string description of the method used to create the ROC curve (e.g. "leave-one-out" or "five-fold").
      Returns:
      validation method
    • getRocX

      public float[] getRocX()
      Returns X-values that can be used to plot the ROC-curve.
    • getRocY

      public float[] getRocY()
      Returns Y-values that can be used to plot the ROC-curve.
    • getNoteTitle

      public String getNoteTitle()
      Returns the optional title used to describe the model.
      Returns:
      title (may be null)
    • setNoteTitle

      public void setNoteTitle(String title)
      Provides an arbitrary title string that briefly summarises the model.
      Parameters:
      title - short text description (no newlines or tabs); use null if none
    • getNoteOrigin

      public String getNoteOrigin()
      Returns the optional description of the source for the model.
      Returns:
      origin (may be null)
    • setNoteOrigin

      public void setNoteOrigin(String origin)
      Provides an arbitrary string that briefly describes the model origin, which may include authors, data source keywords, or other pertinent information.
      Parameters:
      origin - short text description (no newlines or tabs); use null if none
    • getNoteComments

      public String[] getNoteComments()
      Returns the optional comments, which is a list of arbitrary text strings.
      Returns:
      comment list (may be null)
    • setNoteComments

      public void setNoteComments(String[] comments)
      Sets the comments for the model, which is a list of strings containing arbitrary content. This may embellish upon the title or origin, or provide other human-readable information, such as references and links.
      Parameters:
      comments - list of strings; use null or empty array if none
    • serialise

      public String serialise()
      Converts the current model into a serialised string representation. The serialised form omits the original data that was used to build the model, but otherwise contains all of the information necessary to recreate the model and use it to make predictions against new molecules. The format used is a concise text-based format that is easy to recognise by its prefix, and is reasonably efficient with regard to storage space.
      Returns:
      serialised model
    • deserialise

      public static Bayesian deserialise(String str) throws IOException
      Converts a given string into a Bayesian model instance, or throws an exception if it is not valid.
      Parameters:
      str - string containing the serialised model
      Returns:
      instantiated model that can be used for predictions
      Throws:
      IOException
    • deserialise

      public static Bayesian deserialise(BufferedReader rdr) throws IOException
      Reads the incoming stream and attempts to convert it into an instantiated model. The input most be compatible with the format used by the serialise() method, otherwise an exception will be thrown.
      Parameters:
      rdr - reader
      Returns:
      instantiated model that can be used for predictions
      Throws:
      IOException