DatasetGenerator

Load/create shapes from the protein pool and compute interactions for IP (docking pose prediction) and FI (fact-of-interaction) datasets. Interactions that score below specified decision thresholds are added to their respective dataset. Dataset generation figures and statistics can be generated and saved to file.

class Dock2D.DatasetGeneration.DatasetGenerator.DatasetGenerator

__init__()

Initialize and modify all dataset generation parameters here. Creates both training set and testing set protein pools if they do not already exist in the specified pool_savepath.

### Initializations START
self.plotting = True
self.plot_freq = 100
self.show = True
self.trainset_pool_stats = None
self.testset_pool_stats = None

## Paths
self.pool_savepath = 'PoolData/'
self.poolstats_savepath = 'PoolData/stats/'
self.data_savepath = '../Datasets/'
self.datastats_savepath = '../Datasets/stats/'
self.log_savepath = 'Log/losses/'

## initialize FFT
self.FFT = TorchDockingFFT()

## number of unique protein shapes to generate in pool
self.trainpool_num_proteins = 10
self.testpool_num_proteins = 10

## proportion of training set kept for validation
self.validation_set_cutoff = 0.8

## shape feature scoring coefficients
self.weight_bound, self.weight_crossterm, self.weight_bulk = 10, 20, 200

## energy cutoff for deciding if a shape interacts or not
self.docking_decision_threshold = -90
self.interaction_decision_threshold = -90

## string of scoring coefficients for plot titles and filenames
self.weight_string = str(self.weight_bound) + ',' + str(self.weight_crossterm1) + ','\
                     + str(self.weight_crossterm2) + ',' + str(self.weight_bulk)

## training and testing set pool filenames
self.trainvalidset_protein_pool = 'trainvalidset_protein_pool' + str(self.trainpool_num_proteins) + '.pkl'
self.testset_protein_pool = 'testset_protein_pool' + str(self.testpool_num_proteins) + '.pkl'

### Generate training/validation set protein pool
## dataset parameters (value, relative frequency)
self.train_alpha = [(0.80, 1), (0.85, 2), (0.90, 1)] # concavity level [0-1)
self.train_num_points = [(60, 1), (80, 2), (100, 1)] # number of points for shape generation [0-1)
self.train_params = ParamDistribution(alpha=self.train_alpha, num_points=self.train_num_points)

### Generate testing set protein pool
## dataset parameters (value, relative frequency)
self.test_alpha = [(0.70, 1), (0.80, 4), (0.90, 6), (0.95, 4), (0.98, 1)]
self.test_num_points = [(40, 1), (60, 3), (80, 3), (100, 1)]
self.test_params = ParamDistribution(alpha=self.test_alpha, num_points=self.test_num_points)
### Initializations END

generate_datasets(protein_pool, num_proteins=None)

Generate docking and interaction dataset based on protein pool.

Parameters

protein_pool – protein pool .pkl filename
num_proteins – can specify size of protein pool to use in generating pairwise interactions, None uses the entire protein pool.

Returns

energies_list used only in plotting, the docking_set is the docking dataset (IP) as a list of lists [receptor, ligand, rot, trans], and interaction_set a list of [protein_shapes, indices_list, labels_list]

generate_interactions(receptor, ligand)

Generate pairwise interactions through FFT scoring of shape bulk and boundary.

Parameters

receptor – a shape assigned as receptor from protein pool
ligand – a shape assigned as ligand from protein pool

Returns

receptor, ligand, and their fft_score

generate_pool(params, pool_savefile, num_proteins)

Generate the protein pool using parameters for concavity and number of points.

Parameters

params –

Pool generation parameters as a list of tuples for alpha and number of points as (value, relative freq).

shape_alpha = [(0.70, 1), (0.80, 4), (0.90, 6), (0.95, 4), (0.98, 1)]
num_points = [(40, 1), (60, 3), (80, 3), (100, 1)]

pool_savefile – protein pool .pkl filename
num_proteins – number of unique protein shapes to make

Returns

stats observed sampling of alphas and num_points used in protein pool creation

plot_accepted_rejected_shapes(receptor, ligand, rot, trans, minimum_energy, free_energy, fft_score, protein_pool_prefix, plot_count)

Plot examples of accepted and rejected shape pairs, based on docking and interaction decision thresholds set.

Parameters

receptor – receptor shape image
ligand – ligand shape image
rot – rotation to apply to ligand
trans – translation to apply to ligand
minimum_energy – minimum energy from FFT score
fft_score – over all FFT score
protein_pool_prefix – filename prefix
plot_count – used as index in plotting

plot_energy_distributions(energies_list, free_energies, protein_pool_prefix)

Plot histograms of all pairwise energies and free energies, within training and testing set.

Parameters

energies_list – all pairwise energies (E = fft_scores)
free_energies – all pairwise energies (logsumexp(-E))
protein_pool_prefix – used in title and filename

run_generator()

Generates the training, validation, and testing sets for both docking (IP) and interaction (FI) from current protein pool. Write all datasets to .pkl files. Saves all metrics to file. Prints IP and FI dataset stats. If self.plotting=True plot and save dataset generation plots. Specify self.show=True to show each plot in a new window (does not affect saving).

Links to plotting methods:

plot_energy_distributions()

plot_rotation_energysurface()

plot_accepted_rejected_shapes()

plot_deltaF_distribution()

plot_shapes_and_params()