Data

Functions for preprocessing and storing SMILES data.

Preprocessing

potencyscreen.data.standardize_smiles(dataframe, smiles_column)[source]

Adds a column ‘standard_smiles’ to the input dataframe by sanitizing and standardizing the SMILES strings.

Parameters:
  • dataframe (DataFrame) – A pandas dataframe containing the SMILES strings.

  • smiles_column (str) – A string deining the dataframe column containing SMILES.

Return type:

DataFrame

Returns:

The input dataframe with standardized SMILES added in the column ‘standard_smiles’.

Dataloading

class potencyscreen.data.TorchDataset(smiles, y, featurizer)[source]

PyTorch Dataset class to store and load SMILES data for deep learning in potencyscreen.trainers.PyTorchTrainer.

This class is adapted from the molfeat PyG tutorial.

featurizer

Featurizer function applied during dataloading to extract graph features from SMILES.

Type:

PYGGraphTransformer

y

Target property data.

Type:

torch.Tensor

transformed_mols

The extracted features for all molecules in the dataset.

Type:

List

__init__(smiles, y, featurizer)[source]

Torch Dataset.

Parameters:
  • smiles (Series) – SMILES input data.

  • y (Series) – Corresponding target property for each SMILES.

  • featurizer (PYGGraphTransformer) – Molfeat featurizer to apply to SMILES during data loading.

collate_fn(**kwargs)[source]

collate_fn for PyTorch torch.utils.data.DataLoader.

property degree

Returns the histogram of in-degrees of nodes for use in PNA.

property num_atom_features

Returns dimension of atom features extracted from featurizer.

property num_bond_features

Returns dimension of bond features extracted from featurizer.

property num_output

Returns dimension of target property.