DatasetGenerator
Load/create shapes from the protein pool and compute interactions for IP (docking pose prediction) and FI (fact-of-interaction) datasets. Interactions that score below specified decision thresholds are added to their respective dataset. Dataset generation figures and statistics can be generated and saved to file.
- class Dock2D.DatasetGeneration.DatasetGenerator.DatasetGenerator
- __init__()
Initialize and modify all dataset generation parameters here. Creates both training set and testing set protein pools if they do not already exist in the specified
pool_savepath
.### Initializations START self.plotting = True self.plot_freq = 100 self.show = True self.trainset_pool_stats = None self.testset_pool_stats = None ## Paths self.pool_savepath = 'PoolData/' self.poolstats_savepath = 'PoolData/stats/' self.data_savepath = '../Datasets/' self.datastats_savepath = '../Datasets/stats/' self.log_savepath = 'Log/losses/' ## initialize FFT self.FFT = TorchDockingFFT() ## number of unique protein shapes to generate in pool self.trainpool_num_proteins = 10 self.testpool_num_proteins = 10 ## proportion of training set kept for validation self.validation_set_cutoff = 0.8 ## shape feature scoring coefficients self.weight_bound, self.weight_crossterm, self.weight_bulk = 10, 20, 200 ## energy cutoff for deciding if a shape interacts or not self.docking_decision_threshold = -90 self.interaction_decision_threshold = -90 ## string of scoring coefficients for plot titles and filenames self.weight_string = str(self.weight_bound) + ',' + str(self.weight_crossterm1) + ','\ + str(self.weight_crossterm2) + ',' + str(self.weight_bulk) ## training and testing set pool filenames self.trainvalidset_protein_pool = 'trainvalidset_protein_pool' + str(self.trainpool_num_proteins) + '.pkl' self.testset_protein_pool = 'testset_protein_pool' + str(self.testpool_num_proteins) + '.pkl' ### Generate training/validation set protein pool ## dataset parameters (value, relative frequency) self.train_alpha = [(0.80, 1), (0.85, 2), (0.90, 1)] # concavity level [0-1) self.train_num_points = [(60, 1), (80, 2), (100, 1)] # number of points for shape generation [0-1) self.train_params = ParamDistribution(alpha=self.train_alpha, num_points=self.train_num_points) ### Generate testing set protein pool ## dataset parameters (value, relative frequency) self.test_alpha = [(0.70, 1), (0.80, 4), (0.90, 6), (0.95, 4), (0.98, 1)] self.test_num_points = [(40, 1), (60, 3), (80, 3), (100, 1)] self.test_params = ParamDistribution(alpha=self.test_alpha, num_points=self.test_num_points) ### Initializations END
- generate_datasets(protein_pool, num_proteins=None)
Generate docking and interaction dataset based on protein pool.
- Parameters
protein_pool – protein pool .pkl filename
num_proteins – can specify size of protein pool to use in generating pairwise interactions,
None
uses the entire protein pool.
- Returns
energies_list
used only in plotting, thedocking_set
is the docking dataset (IP) as a list of lists [receptor, ligand, rot, trans], andinteraction_set
a list of [protein_shapes, indices_list, labels_list]
- generate_interactions(receptor, ligand)
Generate pairwise interactions through FFT scoring of shape bulk and boundary.
- Parameters
receptor – a shape assigned as
receptor
from protein poolligand – a shape assigned as
ligand
from protein pool
- Returns
receptor, ligand,
and theirfft_score
- generate_pool(params, pool_savefile, num_proteins)
Generate the protein pool using parameters for concavity and number of points.
- Parameters
params –
Pool generation parameters as a list of tuples for alpha and number of points as (value, relative freq).
shape_alpha = [(0.70, 1), (0.80, 4), (0.90, 6), (0.95, 4), (0.98, 1)] num_points = [(40, 1), (60, 3), (80, 3), (100, 1)]
pool_savefile – protein pool .pkl filename
num_proteins – number of unique protein shapes to make
- Returns
stats
observed sampling of alphas and num_points used in protein pool creation
- plot_accepted_rejected_shapes(receptor, ligand, rot, trans, minimum_energy, free_energy, fft_score, protein_pool_prefix, plot_count)
Plot examples of accepted and rejected shape pairs, based on docking and interaction decision thresholds set.
- Parameters
receptor – receptor shape image
ligand – ligand shape image
rot – rotation to apply to ligand
trans – translation to apply to ligand
minimum_energy – minimum energy from FFT score
fft_score – over all FFT score
protein_pool_prefix – filename prefix
plot_count – used as index in plotting
- plot_energy_distributions(energies_list, free_energies, protein_pool_prefix)
Plot histograms of all pairwise energies and free energies, within training and testing set.
- Parameters
energies_list – all pairwise energies (E = fft_scores)
free_energies – all pairwise energies (logsumexp(-E))
protein_pool_prefix – used in title and filename
- run_generator()
Generates the training, validation, and testing sets for both docking (IP) and interaction (FI) from current protein pool. Write all datasets to .pkl files. Saves all metrics to file. Prints IP and FI dataset stats. If
self.plotting=True
plot and save dataset generation plots. Specifyself.show=True
to show each plot in a new window (does not affect saving).Links to plotting methods:
plot_rotation_energysurface()
plot_deltaF_distribution()
plot_shapes_and_params()