Class Bayesian
java.lang.Object
org.openscience.cdk.fingerprint.model.Bayesian
 Bayesian models using fingerprints: provides model creation, analysis, prediction and serialisation.
 Uses a variation of the classic Bayesian model, using a Laplacian correction, which sums log values of ratios rather than multiplying them together. This is an effective way to work with large numbers of fingerprints without running into extreme numerical precision issues, but it also means that the outgoing predictor is an arbitrary value rather than a probability, which introduces the need for an additional calibration step prior to interpretation.
 For more information about the method, see: J. Chem. Inf. Model, v.46, pp.11241133 (2006) J. Biomol. Screen., v.10, pp.682686 (2005) Molec. Divers., v.10, pp.283299 (2006)
 Currently only the CircularFingerprinter fingerprints are supported (i.e. ECFP_n and FCFP_n).
 Model building is done by selecting the fingerprinting method and folding size, then providing a series of molecules & responses. Individual model contributions are kept around in order to produce the analysis data (e.g. the ROC curve), but is discarded during serialise/deserialise cycles.
 Fingerprint "folding" is optional, but recommended, because it places an upper limit on the model size. If folding is not used (folding=0) then the entire 32bits are used, which means that in the diabolical case, the number of Bayesian contributions that needs to be stored is 4 billion. In practice the improvement in predictivity tends to plateaux out at around 1024 bits, so values of 2048 or 4096 are generally safe. Folding values must be integer powers of 2.
 Author:
 am.clark
 Source code:
 main
 Belongs to CDK module:
 standard
 Keywords:
 fingerprint, bayesian, model
 Created on:
 20150105

Field Summary

Constructor Summary

Method Summary
Modifier and TypeMethodDescriptionvoid
addMolecule
(IAtomContainer mol, boolean active) Appends a new row to the model source data, which consists of a molecule and whether or not it is considered active.void
build()
Performs that Bayesian model generation, using the {molecule:activity} pairs that have been submitted up to this point.void
Clears out the training set, to free up memory.static Bayesian
Reads the incoming stream and attempts to convert it into an instantiated model.static Bayesian
deserialise
(String str) Converts a given string into a Bayesian model instance, or throws an exception if it is not valid.int
Access to the fingerprint type.int
Access to the fingerprint folding extent.String[]
Returns the optional comments, which is a list of arbitrary text strings.Returns the optional description of the source for the model.Returns the optional title used to describe the model.double
Returns the integral of the areaunderthecurve of the receiveroperatorcharacteristic.Returns a string description of the method used to create the ROC curve (e.g.float[]
getRocX()
Returns Xvalues that can be used to plot the ROCcurve.float[]
getRocY()
Returns Yvalues that can be used to plot the ROCcurve.int
Returns the number of actives in the training set that was used to create the model.int
Returns the size of the training set, i.e.double
predict
(IAtomContainer mol) For a given molecule, determines its fingerprints and uses them to calculate a Bayesian prediction.double
scalePredictor
(double pred) Converts a raw Bayesian prediction and transforms it into a probabilitylike range, i.e.Converts the current model into a serialised string representation.void
setNoteComments
(String[] comments) Sets the comments for the model, which is a list of strings containing arbitrary content.void
setNoteOrigin
(String origin) Provides an arbitrary string that briefly describes the model origin, which may include authors, data source keywords, or other pertinent information.void
setNoteTitle
(String title) Provides an arbitrary title string that briefly summarises the model.void
setPerceiveStereo
(boolean val) Sets whether stereochemistry should be reperceived from 2D/3D coordinates.void
Produces a ROC validation set by partitioning the inputs into 5 groups, and performing five separate 80% in/20% out model simulations.void
Produces an ROC validation set, using the inputs provided prior to the model building, using leaveoneout.void
Produces a ROC validation set by partitioning the inputs into 3 groups, and performing three separate 66% in/33% out model simulations.

Field Details

inHash

training

activity

contribs

lowThresh
protected double lowThresh 
highThresh
protected double highThresh 
range
protected double range 
invRange
protected double invRange 
estimates
protected double[] estimates 
rocX
protected float[] rocX 
rocY
protected float[] rocY 
rocType

rocAUC
protected double rocAUC 
trainingSize
protected int trainingSize 
trainingActives
protected int trainingActives


Constructor Details

Bayesian
public Bayesian(int classType) Instantiate a Bayesian model with no data. Parameters:
classType
 one of the CircularFingerprinter.CLASS_* constants

Bayesian
public Bayesian(int classType, int folding) Instantiate a Bayesian model with no data. * @param classType one of the CircularFingerprinter.CLASS_* constants Parameters:
folding
 the maximum number of fingerprint bits, which must be a power of 2 (e.g. 1024, 2048) or 0 for no folding


Method Details

setPerceiveStereo
public void setPerceiveStereo(boolean val) Sets whether stereochemistry should be reperceived from 2D/3D coordinates. By default stereochemistry encoded asIStereoElement
s are used. Parameters:
val
 perceived from 2D

getClassType
public int getClassType()Access to the fingerprint type. Returns:
 fingerprint class, one of CircularFingerprinter.CLASS_*

getFolding
public int getFolding()Access to the fingerprint folding extent. Returns:
 folding extent, either 0 (for none) or a power of 2

addMolecule
Appends a new row to the model source data, which consists of a molecule and whether or not it is considered active. Parameters:
mol
 molecular structure, which must be nonblankactive
 whether active or not Throws:
CDKException

build
Performs that Bayesian model generation, using the {molecule:activity} pairs that have been submitted up to this point. Once this method has finished, the object can be used to generate predictions, validation data or to serialise for later use. Throws:
CDKException

predict
For a given molecule, determines its fingerprints and uses them to calculate a Bayesian prediction. Note that this value is unscaled, and so it only has relative meaning within the confines of the model, i.e. higher is more likely to be active. Parameters:
mol
 molecular structure which cannot be blank or null Returns:
 predictor value
 Throws:
CDKException

scalePredictor
public double scalePredictor(double pred) Converts a raw Bayesian prediction and transforms it into a probabilitylike range, i.e. most values within the domain are between 0..1, and assigning a cutoff of activie = scaled_prediction > 0.5 is reasonable. The transform (scale/translation) is determined by the ROCanalysis, if any. The resulting value can be used as a probability by capping the values so that 0 ≤ p ≤ 1. Parameters:
pred
 raw prediction, as provided by the predict(..) method Returns:
 scaled prediction

validateLeaveOneOut
public void validateLeaveOneOut()Produces an ROC validation set, using the inputs provided prior to the model building, using leaveoneout. Note that this should only be used for small datasets, since it is very thorough, and scales as O(N^2) relative to training set size. 
validateFiveFold
public void validateFiveFold()Produces a ROC validation set by partitioning the inputs into 5 groups, and performing five separate 80% in/20% out model simulations. This is quite efficient, and takes approximately 5 times as long as building the original model: it should be used for larger datasets. 
validateThreeFold
public void validateThreeFold()Produces a ROC validation set by partitioning the inputs into 3 groups, and performing three separate 66% in/33% out model simulations. This is quite efficient, and takes approximately 3 times as long as building the original model: it should be used for larger datasets. 
clearTraining
public void clearTraining()Clears out the training set, to free up memory. 
getTrainingSize
public int getTrainingSize()Returns the size of the training set, i.e. the total number of molecules used to create the model. Returns:
 training set size

getTrainingActives
public int getTrainingActives()Returns the number of actives in the training set that was used to create the model. Returns:
 actives in training set

getROCAUC
public double getROCAUC()Returns the integral of the areaunderthecurve of the receiveroperatorcharacteristic. A value of 1 means perfect recall, 0.5 is pretty much random. Returns:
 ROC area under the curve, between 0 and 1

getROCType
Returns a string description of the method used to create the ROC curve (e.g. "leaveoneout" or "fivefold"). Returns:
 validation method

getRocX
public float[] getRocX()Returns Xvalues that can be used to plot the ROCcurve. 
getRocY
public float[] getRocY()Returns Yvalues that can be used to plot the ROCcurve. 
getNoteTitle
Returns the optional title used to describe the model. Returns:
 title (may be null)

setNoteTitle
Provides an arbitrary title string that briefly summarises the model. Parameters:
title
 short text description (no newlines or tabs); use null if none

getNoteOrigin
Returns the optional description of the source for the model. Returns:
 origin (may be null)

setNoteOrigin
Provides an arbitrary string that briefly describes the model origin, which may include authors, data source keywords, or other pertinent information. Parameters:
origin
 short text description (no newlines or tabs); use null if none

getNoteComments
Returns the optional comments, which is a list of arbitrary text strings. Returns:
 comment list (may be null)

setNoteComments
Sets the comments for the model, which is a list of strings containing arbitrary content. This may embellish upon the title or origin, or provide other humanreadable information, such as references and links. Parameters:
comments
 list of strings; use null or empty array if none

serialise
Converts the current model into a serialised string representation. The serialised form omits the original data that was used to build the model, but otherwise contains all of the information necessary to recreate the model and use it to make predictions against new molecules. The format used is a concise textbased format that is easy to recognise by its prefix, and is reasonably efficient with regard to storage space. Returns:
 serialised model

deserialise
Converts a given string into a Bayesian model instance, or throws an exception if it is not valid. Parameters:
str
 string containing the serialised model Returns:
 instantiated model that can be used for predictions
 Throws:
IOException

deserialise
Reads the incoming stream and attempts to convert it into an instantiated model. The input most be compatible with the format used by the serialise() method, otherwise an exception will be thrown. Parameters:
rdr
 reader Returns:
 instantiated model that can be used for predictions
 Throws:
IOException
