DatasetGenerator

Load/create shapes from the protein pool and compute interactions for IP (docking pose prediction) and FI (fact-of-interaction) datasets. Interactions that score below specified decision thresholds are added to their respective dataset. Dataset generation figures and statistics can be generated and saved to file.

class Dock2D.DatasetGeneration.DatasetGenerator.DatasetGenerator
__init__()

Initialize and modify all dataset generation parameters here. Creates both training set and testing set protein pools if they do not already exist in the specified pool_savepath.

### Initializations START
self.plotting = True
self.plot_freq = 100
self.show = True
self.trainset_pool_stats = None
self.testset_pool_stats = None

## Paths
self.pool_savepath = 'PoolData/'
self.poolstats_savepath = 'PoolData/stats/'
self.data_savepath = '../Datasets/'
self.datastats_savepath = '../Datasets/stats/'
self.log_savepath = 'Log/losses/'

## initialize FFT
self.FFT = TorchDockingFFT()

## number of unique protein shapes to generate in pool
self.trainpool_num_proteins = 10
self.testpool_num_proteins = 10

## proportion of training set kept for validation
self.validation_set_cutoff = 0.8

## shape feature scoring coefficients
self.weight_bound, self.weight_crossterm, self.weight_bulk = 10, 20, 200

## energy cutoff for deciding if a shape interacts or not
self.docking_decision_threshold = -90
self.interaction_decision_threshold = -90

## string of scoring coefficients for plot titles and filenames
self.weight_string = str(self.weight_bound) + ',' + str(self.weight_crossterm1) + ','\
                     + str(self.weight_crossterm2) + ',' + str(self.weight_bulk)

## training and testing set pool filenames
self.trainvalidset_protein_pool = 'trainvalidset_protein_pool' + str(self.trainpool_num_proteins) + '.pkl'
self.testset_protein_pool = 'testset_protein_pool' + str(self.testpool_num_proteins) + '.pkl'

### Generate training/validation set protein pool
## dataset parameters (value, relative frequency)
self.train_alpha = [(0.80, 1), (0.85, 2), (0.90, 1)] # concavity level [0-1)
self.train_num_points = [(60, 1), (80, 2), (100, 1)] # number of points for shape generation [0-1)
self.train_params = ParamDistribution(alpha=self.train_alpha, num_points=self.train_num_points)

### Generate testing set protein pool
## dataset parameters (value, relative frequency)
self.test_alpha = [(0.70, 1), (0.80, 4), (0.90, 6), (0.95, 4), (0.98, 1)]
self.test_num_points = [(40, 1), (60, 3), (80, 3), (100, 1)]
self.test_params = ParamDistribution(alpha=self.test_alpha, num_points=self.test_num_points)
### Initializations END
generate_datasets(protein_pool, num_proteins=None)

Generate docking and interaction dataset based on protein pool.

Parameters
  • protein_pool – protein pool .pkl filename

  • num_proteins – can specify size of protein pool to use in generating pairwise interactions, None uses the entire protein pool.

Returns

energies_list used only in plotting, the docking_set is the docking dataset (IP) as a list of lists [receptor, ligand, rot, trans], and interaction_set a list of [protein_shapes, indices_list, labels_list]

generate_interactions(receptor, ligand)

Generate pairwise interactions through FFT scoring of shape bulk and boundary.

Parameters
  • receptor – a shape assigned as receptor from protein pool

  • ligand – a shape assigned as ligand from protein pool

Returns

receptor, ligand, and their fft_score

generate_pool(params, pool_savefile, num_proteins)

Generate the protein pool using parameters for concavity and number of points.

Parameters
  • params

    Pool generation parameters as a list of tuples for alpha and number of points as (value, relative freq).

    shape_alpha = [(0.70, 1), (0.80, 4), (0.90, 6), (0.95, 4), (0.98, 1)]
    num_points = [(40, 1), (60, 3), (80, 3), (100, 1)]
    

  • pool_savefile – protein pool .pkl filename

  • num_proteins – number of unique protein shapes to make

Returns

stats observed sampling of alphas and num_points used in protein pool creation

plot_accepted_rejected_shapes(receptor, ligand, rot, trans, minimum_energy, free_energy, fft_score, protein_pool_prefix, plot_count)

Plot examples of accepted and rejected shape pairs, based on docking and interaction decision thresholds set.

Parameters
  • receptor – receptor shape image

  • ligand – ligand shape image

  • rot – rotation to apply to ligand

  • trans – translation to apply to ligand

  • minimum_energy – minimum energy from FFT score

  • fft_score – over all FFT score

  • protein_pool_prefix – filename prefix

  • plot_count – used as index in plotting

plot_energy_distributions(energies_list, free_energies, protein_pool_prefix)

Plot histograms of all pairwise energies and free energies, within training and testing set.

Parameters
  • energies_list – all pairwise energies (E = fft_scores)

  • free_energies – all pairwise energies (logsumexp(-E))

  • protein_pool_prefix – used in title and filename

run_generator()

Generates the training, validation, and testing sets for both docking (IP) and interaction (FI) from current protein pool. Write all datasets to .pkl files. Saves all metrics to file. Prints IP and FI dataset stats. If self.plotting=True plot and save dataset generation plots. Specify self.show=True to show each plot in a new window (does not affect saving).

Links to plotting methods:

plot_rotation_energysurface()
plot_deltaF_distribution()
plot_shapes_and_params()