Utils#

Here are all the utility functions to manipulate the combinatorial complexes, graphs and molecules, to make plots, create logs, load configuration and datasets, etc.

cc_utils.py: utility functions for combinatorial complex data (flag masking, conversions, etc.).

ccsd.src.utils.cc_utils.get_cells(N: int, d_min: int, d_max: int) → Tuple[List[FrozenSet[int]], Dict[FrozenSet[int], int], Dict[int, List[int]], List[FrozenSet[int]], Dict[FrozenSet[int], int], Dict[int, List[int]]][source]#

Get all rank-2 cells of size d_min to d_max. Returns a list of all rank-2 cells, a dictionary mapping rank-2 cells to a column index in the incidence matrix, a dictionary mapping nodes to a list of column indices in the incidence matrix, a list of all edges, a dictionary mapping edges to a row index in the incidence matrix and a dictionary mapping nodes to a list of row indices in the incidence matrix.

Parameters:

N (int) – maximum number of nodes
d_min (int, optional) – minimum size of rank-2 cells.
d_max (int, optional) – maximum size of rank-2 cells.

Returns:

list of all rank-2 cells, dictionary mapping rank-2 cells to a column index in the incidence matrix, dictionary mapping nodes to a list of column indices in the incidence matrix, dictionary mapping edges to a row index in the incidence matrix and a dictionary mapping nodes to a list of row indices in the incidence matrix

Return type:

Tuple[List[FrozenSet[int]], Dict[FrozenSet[int], int], Dict[int, List[int]], List[FrozenSet[int]], Dict[FrozenSet[int], int], Dict[int, List[int]]]

ccsd.src.utils.cc_utils.create_incidence_1_2(N: int, A: ndarray | Tensor, d_min: int, d_max: int, two_rank_cells: Dict[FrozenSet[int], Dict[str, Any]]) → ndarray[source]#

Create the incidence matrix of rank-1 to rank-2 cells from an adjacency matrix and a list of the rank-2 cells of the CC.

Parameters:

N (int) – maximum number of nodes
A (Union[np.ndarray, torch.Tensor]) – adjacency matrix
d_min (int) – minimum size of rank-2 cells
d_max (int) – maximum size of rank-2 cells
two_rank_cells (Dict[FrozenSet[int], Dict[str, Any]]) – list of rank-2 cells

Returns:

incidence matrix of rank-1 to rank-2 cells

Return type:

np.ndarray

ccsd.src.utils.cc_utils.cc_from_incidence(incidence_matrices_: List[ndarray | None] | List[Tensor | None] | None, d_min: int, d_max: int, is_molecule: bool = False) → CombinatorialComplex[source]#

Convert (pseudo)-incidence matrices to a combinatorial complex (CC).

Parameters:

incidence_matrices (Optional[Union[List[Optional[np.ndarray]], List[Optional[torch.Tensor]]]]) – list of incidence matrices [X, A, F]
d_min (int, optional) – minimum size of rank-2 cells.
d_max (int, optional) – maximum size of rank-2 cells.
is_molecule (bool, optional) – whether the CC is a molecule. Defaults to False.

Raises:

NotImplementedError – raise an error if the CC is of dimension greater than 2 (if len(incidence_matrices_) > 3)

Returns:

combinatorial complex (CC) object

Return type:

CombinatorialComplex

ccsd.src.utils.cc_utils.get_rank2_dim(N: int, d_min: int, d_max: int) → int[source]#

Get the dimension of the rank-2 incidence matrix of a combinatorial complex with the given parameters.

Parameters:

N (int) – maximum number of nodes
d_min (int) – minimum size of rank-2 cells
d_max (int) – maximum size of rank-2 cells

Returns:

dimension of the rank-2 incidence matrix

Return type:

int

ccsd.src.utils.cc_utils.get_mol_from_x_adj(x: Tensor, adj: Tensor, dataset: str = 'QM9') → Mol[source]#

Get a molecule from the node and adjacency matrices after being processed by get_transform_fn inside data_loader_mol.py. Atoms - 0: C, 1: N, 2: O, 3: F, 4: P, 5: S, 6: Cl, 7: Br, 8: I Bonds - 1: single, 2: double, 3: triple

Parameters:

x (torch.Tensor) – node matrix
adj (torch.Tensor) – adjacency matrix

Returns:

molecule (RDKIT mol)

Return type:

Chem.Mol

ccsd.src.utils.cc_utils.get_all_mol_rings(mol: Mol) → List[FrozenSet[int]][source]#

Get all the rings of a molecule.

Parameters:: mol (Chem.Mol) – molecule (RDKIT mol)
Returns:: list of rings as frozensets of atom indices
Return type:: List[FrozenSet[int]]

ccsd.src.utils.cc_utils.mols_to_cc(mols: List[Mol]) → List[CombinatorialComplex][source]#

Convert a list of molecules to a list of combinatorial complexes where the rings are rank-2 cells. This is a lift operation.

This is a general function mostly used for testing. A more complete one is implemented in src/utils/data_loader_mol.py within the MolDataset class.

Parameters:: mols (List[Chem.Mol]) – list of molecules (RDKIT mol)
Returns:: molecules as combinatorial complexes where the cycles are rank-2 cells
Return type:: List[CombinatorialComplex]

Example

python

>>> mols = [Chem.MolFromSmiles("Cc1ccccc1"), Chem.MolFromSmiles("c1cccc2c1CCCC2")]
>>> ccs = mols_to_cc(mols)

ccsd.src.utils.cc_utils.CC_to_incidence_matrices(CC: CombinatorialComplex, d_min: int | None, d_max: int | None, N: int | None = None) → List[ndarray][source]#

Convert a combinatorial complex to a list of incidence matrices.

Parameters:

CC (CombinatorialComplex) – combinatorial complex
d_min (Optional[int]) – minimum size of rank-2 cells. If not provided, calculated from the CC
d_max (Optional[int]) – maximum size of rank-2 cells. If not provided, calculated from the CC
N (Optional[int], optional) – maximum number of nodes. If not provided, calculated from the CC. Defaults to None. This parameter is here just in case but it is better to not use it and to pad the matrices with the correct functions.

Returns:

list of incidence matrices [X, A, F]

Return type:

List[np.ndarray]

ccsd.src.utils.cc_utils.ccs_to_mol(ccs: List[CombinatorialComplex]) → List[Mol][source]#

Convert a list of combinatorial complexes to a list of molecules.

Parameters:

ccs (List[CombinatorialComplex]) – list of combinatorial complexes
convert (that represent molecules to) –

Returns:

list of molecules

Return type:

List[Chem.Mol]

ccsd.src.utils.cc_utils.get_N_from_nb_edges(nb_edges: int) → int[source]#

Get number of nodes from number of edges

Parameters:: nb_edges (int) – number of edges
Returns:: number of nodes
Return type:: int

ccsd.src.utils.cc_utils.get_N_from_rank2(rank2: Tensor) → int[source]#

Get number of nodes from batch of rank2 incidence matrices

Parameters:: rank2 (torch.Tensor) – rank2 incidence matrices (raw, batch, or batch and channel). (NC2) x K or B x (NC2) x K or B x C x (NC2) x K
Returns:: number of nodes
Return type:: int

ccsd.src.utils.cc_utils.get_rank2_flags(rank2: Tensor, N: int, d_min: int, d_max: int, flags: Tensor) → Tuple[Tensor, Tensor][source]#

Get flags for left and right nodes of rank2 cells. The left flag is 0 if the edge is not in the CC as a node is not. The right flag is 0 if the rank-2 cell is not in the CC as a node is not.

Parameters:

rank2 (torch.Tensor) – batch of rank2 incidence matrices. B x (NC2) x K or B x C x (NC2) x K
N (int) – number of nodes
d_min (int) – minimum dimension of rank2 cells
d_max (int) – maximum dimension of rank2 cells
flags (torch.Tensor) – 0-1 flags tensor. B x N

Returns:

flags for left and right nodes of rank2 cells

Return type:

Tuple[torch.Tensor, torch.Tensor]

ccsd.src.utils.cc_utils.mask_rank2(rank2: Tensor, N: int, d_min: int, d_max: int, flags: Tensor | None = None) → Tensor[source]#

Mask batch of rank2 incidence matrices with 0-1 flags tensor

Parameters:

rank2 (torch.Tensor) – batch of rank2 incidence matrices. B x (NC2) x K or B x C x (NC2) x K
N (int) – number of nodes
d_min (int) – minimum dimension of rank2 cells
d_max (int) – maximum number of rank2 cells
flags (Optional[torch.Tensor], optional) – 0-1 flags tensor. Defaults to None. B x N

Returns:

Mask batch of rank2 incidence matrices

Return type:

torch.Tensor

ccsd.src.utils.cc_utils.gen_noise_rank2(x: Tensor, N: int, d_min: int, d_max: int, flags: Tensor | None = None) → Tensor[source]#

Generate noise for the rank-2 incidence matrix

Parameters:

x (torch.Tensor) – input tensor
N (int) – number of nodes
d_min (int) – minimum dimension of rank2 cells
d_max (int) – maximum dimension of rank2 cells
flags (Optional[torch.Tensor], optional) – optional flags. Defaults to None.

Returns:

generated noisy tensor

Return type:

torch.Tensor

ccsd.src.utils.cc_utils.pad_rank2(ori_rank2: ndarray, node_number: int, d_min: int, d_max: int) → ndarray[source]#

Create padded rank2 incidence matrices

Parameters:

ori_adj (np.ndarray) – original rank2 incidence matrix
node_number (int) – number of desired nodes
d_min (int) – minimum dimension of rank2 cells
d_max (int) – maximum dimension of rank2 cells

Raises:

ValueError – if the original rank2 incidence matrix has more nodes larger than the desired number of nodes (we can’t pad)

Returns:

Padded adjacency matrix

Return type:

np.ndarray

ccsd.src.utils.cc_utils.get_global_cc_properties(ccs: List[CombinatorialComplex]) → Tuple[int, int, int][source]#

Get the global properties of a list of combinatorial complexes: number of nodes, minimum dimension of rank2 cells and maximum dimension of rank2 cells

Parameters:: ccs (List[CombinatorialComplex]) – list of combinatorial complexes
Returns:: number of nodes, minimum dimension of rank2 cells and maximum dimension of rank2 cells
Return type:: Tuple[int, int, int]

Example

python

>>> mols = [Chem.MolFromSmiles("Cc1ccccc1"), Chem.MolFromSmiles("c1cccc2c1CCCC2"), Chem.MolFromSmiles("C1CC1")]
>>> ccs = mols_to_cc(mols)
>>> get_global_cc_properties(ccs)
(10, 3, 6)

ccsd.src.utils.cc_utils.ccs_to_tensors(cc_list: List[CombinatorialComplex], max_node_num: int | None = None, d_min: int | None = None, d_max: int | None = None) → Tuple[Tensor, Tensor][source]#

Convert a list of combinatorial complexes to two tensors, one for the adjacency matrices and one for the incidence matrices If the combinatorial complexes have different number of nodes, the adjacency matrices and incidence matrices are padded to the maximum number of nodes. If the max number of nodes is not provided, it is calculated from the combinatorial complexes. Same for the minimum and maximum dimension of rank2 cells.

Parameters:

cc_list (List[CombinatorialComplex]) – list of combinatorial complexes
max_node_num (Optional[int], optional) – max number of nodes in all the combinatorial complexes. Defaults to None.
d_min (Optional[int], optional) – minimum dimension of rank2 cells. Defaults to None.
d_max (Optional[int], optional) – maximum dimension of rank2 cells. Defaults to None.

Returns:

adjacency matrices and rank2 incidence matrices

Return type:

Tuple[torch.Tensor, torch.Tensor]

ccsd.src.utils.cc_utils.cc_to_tensor(cc: CombinatorialComplex, max_node_num: int | None = None, d_min: int | None = None, d_max: int | None = None) → Tuple[Tensor, Tensor][source]#

Convert a single combinatorial complex to a tuple of tensors, one for the adjacency matrix and one for the rank2 incidence matrix If the max number of nodes is not provided, it is calculated from the combinatorial complexes. Same for the minimum and maximum dimension of rank2 cells. Incidence matrices (A, F) are padded to the maximum number of nodes.

Parameters:

cc (CombinatorialComplex) – combinatorial complex to convert
max_node_num (Optional[int], optional) – maximum number of nodes. Defaults to None.
d_min (Optional[int], optional) – minimum dimension of rank2 cells. Defaults to None.
d_max (Optional[int], optional) – maximum dimension of rank2 cells. Defaults to None.

Returns:

adjacency matrix and rank2 incidence matrix

Return type:

Tuple[torch.Tensor, torch.Tensor]

ccsd.src.utils.cc_utils.convert_CC_to_graphs(ccs: List[CombinatorialComplex], undirected: bool = True) → List[Graph][source]#

Convert a list of combinatorial complexes to a list of graphs

Parameters:

ccs (List[CombinatorialComplex]) – list of combinatorial complexes
undirected (bool, optional) – whether to create an undirected graph. Defaults to True.

Returns:

list of graphs

Return type:

List[nx.Graph]

ccsd.src.utils.cc_utils.convert_graphs_to_CCs(graphs: List[Graph], is_molecule: bool = False, lifting_procedure: str | None = None, lifting_procedure_kwargs: str | Dict[Any, Any] | None = None, **kwargs) → List[CombinatorialComplex][source]#

Convert a list of graphs to a list of combinatorial complexes (of dimension 1).

Parameters:

graphs (List[nx.Graph]) – list of graphs
is_molecule (bool, optional) – whether the graphs are molecules. Defaults to False.
lifting_procedure (Optional[str], optional) – lifting procedure to use. Defaults to None.
lifting_procedure_kwargs (Optional[Union[str, Dict[Any, Any]]], optional) – kwargs for the lifting procedure. Defaults to None.

Returns:

list of combinatorial complexes

Return type:

List[CombinatorialComplex]

ccsd.src.utils.cc_utils.init_flags(obj_list: List[Graph] | List[CombinatorialComplex], config: EasyDict, batch_size: int | None = None, is_cc: bool = False) → Tensor[source]#

Sample initial flags tensor from the training graph set

Parameters:

graph_list (List[nx.Graph]) – list of graphs
config (EasyDict) – configuration
batch_size (Optional[int], optional) – batch size. Defaults to None.
is_cc (bool, optional) – is the objects combinatorial complexes?. Defaults to False.

Returns:

flag tensors

Return type:

torch.Tensor

ccsd.src.utils.cc_utils.hodge_laplacian(rank2: Tensor) → Tensor[source]#

Compute the Hodge Laplacian of a batch of rank2 incidence matrices. H = F @ F.T where F is the rank-2 incidence matrix of a combinatorial complex.

Parameters:

rank2 (torch.Tensor) – batch of rank2 incidence matrices. B x (NC2) x K or B x C x (NC2) x K

Returns:

Hodge Laplacian: B x (NC2) x (NC2) or B x C x (NC2) x (NC2)

Return type:

torch.Tensor

ccsd.src.utils.cc_utils.default_mask(n: int, device: str = 'cpu') → Tensor[source]#

Create default adjacency or Hodge Laplacian mask (no diagonal elements)

Parameters:: n (int) – number of nodes or edges
Returns:: default adjacency or Hodge Laplacian mask
Return type:: torch.Tensor

ccsd.src.utils.cc_utils.pow_tensor_cc(x: Tensor, cnum: int, hodge_mask: Tensor | None = None) → Tensor[source]#

Create higher order rank-2 incidence matrices from a batch of rank-2 incidence matrices.

Parameters:

x (torch.Tensor) – input tensor of shape B x (NC2) x K or B x C * (NC2) x K
cnum (int) – number of higher order matrices to create (made with consecutive multiplication of the Hodge Laplacian matrix of x)
hodge_mask (Optional[torch.Tensor], optional) – optional mask to apply to the Hodge Laplacian. Defaults to None. If None, no mask is applied. shape (NC2) x (NC2) or B x (NC2) x (NC2)

Returns:

output higher order matrices of shape B x cnum x (NC2) x K

Return type:

torch.Tensor

ccsd.src.utils.cc_utils.is_empty_cc(cc: CombinatorialComplex) → bool[source]#

Check if a combinatorial complex is empty

Parameters:: cc (CombinatorialComplex) – combinatorial complex
Returns:: whether the combinatorial complex is empty
Return type:: bool

ccsd.src.utils.cc_utils.hodge_laplacian_spectrum_worker(CC: CombinatorialComplex, d_min: int, d_max: int, N: int) → ndarray[source]#

Function for computing the rank-2 cell histogram of a combinatorial complex.

Parameters:

CC (CombinatorialComplex) – combinatorial complex
d_min (int) – minimum dimension of the rank-2 cells
d_max (int) – maximum dimension of the rank-2 cells
N (int) – maximum number of nodes

Returns:

rank-2 cell histogram

Return type:

np.ndarray

ccsd.src.utils.cc_utils.hodge_laplacian_spectrum_stats(cc_ref_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], cc_pred_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], worker_kwargs: ~typing.Dict[str, ~typing.Any], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, debug_mode: bool = False) → float[source]#

Compute the MMD distance between the hodge laplacian eigenvalues distributions of two unordered sets of combinatorial complexes.

Parameters:

cc_ref_list (List[CombinatorialComplex]) – reference list of toponetx combinatorial complexes to be evaluated
cc_pred_list (List[CombinatorialComplex]) – target list of toponetx combinatorial complexes to be evaluated
worker_kwargs (Dict[str, Any]) – kwargs for the worker function
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
debug_mode (bool, optional) – if True, print debug information when is_parallel is set to True. Defaults to False.

Returns:

MMD distance

Return type:

float

ccsd.src.utils.cc_utils.rank0_distrib_worker(CC: CombinatorialComplex, min_node_val: int, max_node_val: int, node_label: str = 'label') → ndarray[source]#

Function for computing the rank-0 cell value histogram of a combinatorial complex. Values are converted to integers.

Parameters:

CC (CombinatorialComplex) – combinatorial complex
min_node_val (int) – minimum node value
max_node_val (int) – maximum node value
node_label (str, optional) – node label, where is stored the value in the CC. Defaults to “label”.

Returns:

rank-0 cell histogram

Return type:

np.ndarray

ccsd.src.utils.cc_utils.rank0_distrib_stats(cc_ref_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], cc_pred_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], worker_kwargs: ~typing.Dict[str, ~typing.Any], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, debug_mode: bool = False) → float[source]#

Compute the MMD distance between the rank-0 cells’ values distributions of two unordered sets of combinatorial complexes.

Parameters:

cc_ref_list (List[CombinatorialComplex]) – reference list of toponetx combinatorial complexes to be evaluated
cc_pred_list (List[CombinatorialComplex]) – target list of toponetx combinatorial complexes to be evaluated
worker_kwargs (Dict[str, Any]) – kwargs for the worker function
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
debug_mode (bool, optional) – if True, print debug information when is_parallel is set to True. Defaults to False.

Returns:

MMD distance

Return type:

float

ccsd.src.utils.cc_utils.rank1_distrib_worker(CC: CombinatorialComplex, min_edge_val: int, max_edge_val: int, edge_label: str = 'label') → ndarray[source]#

Function for computing the rank-1 cell value histogram of a combinatorial complex. Values are converted to integers.

Parameters:

CC (CombinatorialComplex) – combinatorial complex
min_edge_val (int) – minimum edge value
max_edge_val (int) – maximum edge value
edge_label (str, optional) – edge label, where is stored the value in the CC. Defaults to “label”.

Returns:

rank-1 cell histogram

Return type:

np.ndarray

ccsd.src.utils.cc_utils.rank1_distrib_stats(cc_ref_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], cc_pred_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], worker_kwargs: ~typing.Dict[str, ~typing.Any], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, debug_mode: bool = False) → float[source]#

Compute the MMD distance between the rank-1 cells’ values distributions of two unordered sets of combinatorial complexes.

Parameters:

cc_ref_list (List[CombinatorialComplex]) – reference list of toponetx combinatorial complexes to be evaluated
cc_pred_list (List[CombinatorialComplex]) – target list of toponetx combinatorial complexes to be evaluated
worker_kwargs (Dict[str, Any]) – kwargs for the worker function
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
debug_mode (bool, optional) – if True, print debug information when is_parallel is set to True. Defaults to False.

Returns:

MMD distance

Return type:

float

ccsd.src.utils.cc_utils.rank2_distrib_worker(CC: CombinatorialComplex, d_min: int, d_max: int) → ndarray[source]#

Function for computing the rank-2 cell histogram of a combinatorial complex.

Parameters:

CC (CombinatorialComplex) – combinatorial complex
d_min (int) – minimum dimension of the rank-2 cells
d_max (int) – maximum dimension of the rank-2 cells

Returns:

rank-2 cell histogram

Return type:

np.ndarray

ccsd.src.utils.cc_utils.rank2_distrib_stats(cc_ref_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], cc_pred_list: ~typing.List[~toponetx.classes.combinatorial_complex.CombinatorialComplex], worker_kwargs: ~typing.Dict[str, ~typing.Any], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, debug_mode: bool = False) → float[source]#

Compute the MMD distance between the number of rank-2 cells distributions of two unordered sets of combinatorial complexes.

Parameters:

cc_ref_list (List[CombinatorialComplex]) – reference list of toponetx combinatorial complexes to be evaluated
cc_pred_list (List[CombinatorialComplex]) – target list of toponetx combinatorial complexes to be evaluated
worker_kwargs (Dict[str, Any]) – kwargs for the worker function
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
debug_mode (bool, optional) – if True, print debug information when is_parallel is set to True. Defaults to False.

Returns:

MMD distance

Return type:

float

ccsd.src.utils.cc_utils.eval_CC_list(cc_ref_list: List[CombinatorialComplex], cc_pred_list: List[CombinatorialComplex], worker_kwargs: Dict[str, Any], methods: List[str] | None = None, kernels: Dict[str, Callable[[ndarray, ndarray], float]] | None = None, cc_nb_eval: int | None = 1000) → Dict[str, float][source]#

Evaluate generated generic combinatorial complexes against a reference set of combinatorial complexes using a set of methods and their corresponding kernels.

Parameters:

cc_ref_list (List[CombinatorialComplex]) – reference list of toponetx combinatorial complexes to be evaluated
cc_pred_list (List[CombinatorialComplex]) – target list of toponetx combinatorial complexes to be evaluated
worker_kwargs (Dict[str, Any]) – kwargs for the worker functions
methods (Optional[List[str]], optional) – methods to be evaluated. Defaults to None.
kernels (Optional[Dict[str, Callable[[np.ndarray, np.ndarray], float]]], optional) – kernels to be used for each methods. Defaults to None.
cc_nb_eval (Optional[int], optional) – number of reference and predicted combinatorial complexes to be evaluated. If set to None, evaluate on the entire dataset. Defaults to 1000.

Returns:

dictionary mapping method names to their corresponding scores

Return type:

Dict[str, float]

ccsd.src.utils.cc_utils.load_cc_eval_settings() → Tuple[List[str], Dict[str, Callable[[ndarray, ndarray], float]]][source]#

Load the methods and kernels to be used for evaluating combinatorial complexes.

Returns:: methods and kernels to be used for evaluating combinatorial complexes
Return type:: Tuple[List[str], Dict[str, Callable[[np.ndarray, np.ndarray], float]]]

ccsd.src.utils.cc_utils.adj_to_hodgedual(adj: Tensor) → Tensor[source]#

Convert adjacency matrices to Hodge dual adjacency matrices. Matrices are assumed to be symmetric and can be batched and/or have channels.

Parameters:: adj (torch.Tensor) – adjacency matrices (B x C x N x N) or (B x N x N) or (N x N)
Returns:: Hodge dual adjacency matrices (B x C x (NC2) x (NC2)) or (B x (NC2) x (NC2)) or ((NC2) x (NC2))
Return type:: torch.Tensor

ccsd.src.utils.cc_utils.hodgedual_to_adj(hodgedual: Tensor) → Tensor[source]#

Convert Hodge dual adjacency matrices to adjacency matrices. Matrices can be batched and/or have channels.

Parameters:: hodgedual (torch.Tensor) – Hodge dual adjacency matrices (B x C x (NC2) x (NC2)) or (B x (NC2) x (NC2)) or ((NC2) x (NC2))
Returns:: adjacency matrices (B x C x N x N) or (B x N x N) or (N x N)
Return type:: torch.Tensor

ccsd.src.utils.cc_utils.get_hodge_adj_flags(hodge_adj: Tensor, flags: Tensor) → Tuple[Tensor, Tensor][source]#

Get flags for the adjacency matrices. The flag is 0 if the edge is not in the CC as a node is not.

Parameters:

hodge_adj (torch.Tensor) – batch of hodge adjacency matrices. B x (NC2) x (NC2) or B x C x (NC2) x (NC2)
flags (torch.Tensor) – 0-1 flags tensor. B x N

Returns:

flags for the for the adjacency matrices B x (NC2)

Return type:

Tuple[torch.Tensor, torch.Tensor]

ccsd.src.utils.cc_utils.mask_hodge_adjs(hodge_adjs: Tensor, flags: Tensor | None = None) → Tensor[source]#

Mask batch of hodge adjacency matrices with 0-1 flags tensor

Parameters:

hodge_adjs (torch.Tensor) – batch of hodge adjacency matrices. B x (NC2) x (NC2) or B x C x (NC2) x (NC2)
N (int) – number of nodes
flags (Optional[torch.Tensor], optional) – 0-1 flags tensor. Defaults to None. B x N

Returns:

Mask batch of hodge adjacency matrices

Return type:

torch.Tensor

ccsd.src.utils.cc_utils.get_all_paths_from_single_node(n: int, g: Dict[int, List[int]], path_length: int) → Set[FrozenSet[int]][source]#

Get all paths from a dictionary of edges and a list of nodes

Parameters:

n (int) – node to start the paths from
g (Dict[int, List[int]]) – graph
path_length (int) – length of the paths

Returns:

list of paths

Return type:

Set[FrozenSet[int]]

ccsd.src.utils.cc_utils.get_all_paths_from_nodes(nodes: List[int], g: Dict[int, List[int]], path_length: int) → Set[FrozenSet[int]][source]#

Get all paths from a dictionary of edges and a list of nodes

Parameters:

nodes (List[int]) – list of nodes to start the paths from
g (Dict[int, List[int]]) – graph
path_length (int) – length of the paths

Returns:

list of paths

Return type:

Set[FrozenSet[int]]

ccsd.src.utils.cc_utils.path_based_lift_CC(input_cc: CombinatorialComplex, sources_nodes: List[int], path_length: int) → CombinatorialComplex[source]#

Lift a 1-dimensional CC to a 2-dimensional CC by lifting the paths to rank-2 cells. Rank-2 cells must be edges.

Parameters:

input_cc (CombinatorialComplex) – original combinatorial complex
sources_nodes (List[int]) – list of source nodes to start the paths from
path_length (int) – length of the paths to lift

Returns:

lifted combinatorial complex

Return type:

CombinatorialComplex

ccsd.src.utils.cc_utils.cycles_lift_CC(input_cc: CombinatorialComplex) → CombinatorialComplex[source]#

Lift a 1-dimensional CC to a 2-dimensional CC by lifting the cycles to rank-2 cells.

Parameters:: input_cc (CombinatorialComplex) – original combinatorial complex
Returns:: lifted combinatorial complex
Return type:: CombinatorialComplex

data_loader_mol.py: utility functions for loading the graph data (molecular ones).

Only dataloader_mol left untouched from Jo, J. & al (2022)

ccsd.src.utils.data_loader_mol.load_mol(filepath: str) → List[Tuple[Any, Any]][source]#

Load molecular dataset from filepath.

Adapted from GraphEBM

Parameters:: filepath (str) – filepath to the dataset
Raises:: ValueError – raise an error if the filepath is invalid
Returns:: list of tuples of (node features, adjacency matrix)
Return type:: List[Tuple[Any, Any]]

class ccsd.src.utils.data_loader_mol.MolDataset(mols: List[Tuple[ndarray, ndarray]], transform: Callable[[Tuple[ndarray, ndarray]], Tuple[Tensor, Tensor]] | Callable[[Tuple[ndarray, ndarray]], Tuple[Tensor, Tensor, Tensor]])[source]#

Bases: Dataset

Dataset object for molecular dataset.

__init__(mols: List[Tuple[ndarray, ndarray]], transform: Callable[[Tuple[ndarray, ndarray]], Tuple[Tensor, Tensor]] | Callable[[Tuple[ndarray, ndarray]], Tuple[Tensor, Tensor, Tensor]]) → None[source]#

Initialize the dataset.

Parameters:

mols (List[Tuple[np.ndarray, np.ndarray]]) – list of tuples of (node features, adjacency matrix)
transform (Union[Callable[[Tuple[np.ndarray, np.ndarray]], Tuple[torch.Tensor, torch.Tensor]], Callable[[Tuple[np.ndarray, np.ndarray]], Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]]) – transform function that transforms the data into tensors with some preprocessing. Two tensors for graph-based modelisation and three tensors for combinatorial complex-based modelisation.

ccsd.src.utils.data_loader_mol.get_transform_fn(dataset: str, is_cc: bool = False, **kwargs: Any) → Callable[[Tuple[ndarray, ndarray]], Tuple[Tensor, Tensor]] | Callable[[Tuple[ndarray, ndarray]], Tuple[Tensor, Tensor, Tensor]][source]#

Get the transform function for the given dataset.

Parameters:

dataset (str) – name of the dataset
is_cc (bool, optional) – if True, the transform function returns three tensors for combinatorial complexes modelisation. Defaults to False.

Raises:

ValueError – raise an error if the dataset is invalid/unsupported

Returns:

transform function that transforms the data into tensors with some preprocessing. Two tensors for graph-based modelisation and three tensors for combinatorial complex-based modelisation.

Return type:

Union[Callable[[Tuple[np.ndarray, np.ndarray]], Tuple[torch.Tensor, torch.Tensor]], Callable[[Tuple[np.ndarray, np.ndarray]], Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]]

ccsd.src.utils.data_loader_mol.dataloader_mol(config: EasyDict, get_graph_list: bool = False) → Tuple[DataLoader, DataLoader] | Tuple[List[Graph], List[Graph]][source]#

Load the dataset and return the train and test dataloader for the given molecular dataset.

Parameters:

config (EasyDict) – configuration to use
get_graph_list (bool, optional) – if True, the dataloader are lists of graphs. Defaults to False.

Returns:

train and test dataloader (tensors or lists of graphs)

Return type:

Union[Tuple[DataLoader, DataLoader], Tuple[List[nx.Graph], List[nx.Graph]]]

ccsd.src.utils.data_loader_mol.dataloader_mol_cc(config: EasyDict, get_cc_list: bool = False) → Tuple[DataLoader, DataLoader] | Tuple[List[CombinatorialComplex], List[CombinatorialComplex]][source]#

Load the dataset and return the train and test dataloader for the given molecular dataset.

Parameters:

config (EasyDict) – configuration to use
get_cc_list (bool, optional) – if True, the dataloader are lists of combinatorial complexes. Defaults to False.

Returns:

train and test dataloader (tensors or lists of combinatorial complexes)

Return type:

Union[Tuple[DataLoader, DataLoader], Tuple[List[CombinatorialComplex], List[CombinatorialComplex]]]

data_loader.py: utility functions for loading the graph data (not molecular ones).

Only dataloader left untouched from Jo, J. & al (2022)

ccsd.src.utils.data_loader.graphs_to_dataloader(config: EasyDict, graph_list: List[Graph]) → DataLoader[source]#

Convert a list of graphs to a dataloader.

Parameters:

config (EasyDict) – configuration to use
graph_list (List[nx.Graph]) – list of graphs to convert

Returns:

DataLoader object for the graphs

Return type:

DataLoader

ccsd.src.utils.data_loader.ccs_to_dataloader(config: EasyDict, cc_list: List[CombinatorialComplex]) → DataLoader[source]#

Convert a list of combinatorial complexes to a dataloader.

Parameters:

config (EasyDict) – configuration to use
cc_list (List[CombinatorialComplex]) – list of combinatorial complexes to convert

Returns:

DataLoader object for the combinatorial complexes

Return type:

DataLoader

ccsd.src.utils.data_loader.dataloader(config: EasyDict, get_graph_list: bool = False) → Tuple[DataLoader, DataLoader] | Tuple[List[Graph], List[Graph]][source]#

Load the dataset and return the train and test dataloader for the given non-molecular dataset.

Parameters:

config (EasyDict) – configuration to use
get_graph_list (bool, optional) – if True, the dataloader are lists of graphs. Defaults to False.

Returns:

train and test dataloader (tensors or lists of graphs)

Return type:

Union[Tuple[DataLoader, DataLoader], Tuple[List[nx.Graph], List[nx.Graph]]]

ccsd.src.utils.data_loader.dataloader_cc(config: EasyDict, get_cc_list: bool = False) → Tuple[DataLoader, DataLoader] | Tuple[List[CombinatorialComplex], List[CombinatorialComplex]][source]#

Load the dataset and return the train and test dataloader for the given non-molecular dataset.

Parameters:

config (EasyDict) – configuration to use
get_cc_list (bool, optional) – if True, the dataloader are lists of combinatorial complexes. Defaults to False.

Returns:

train and test dataloader (tensors or lists of combinatorial complexes)

Return type:

Union[Tuple[DataLoader, DataLoader], Tuple[List[CombinatorialComplex], List[CombinatorialComplex]]]

ema.py: code for the exponential moving average class for the parameters.

Adapted from Jo, J. & al (2022), almost left untouched.

class ccsd.src.utils.ema.ExponentialMovingAverage(parameters: Parameter, decay: float, use_num_updates: bool = True)[source]#

Bases: object

Maintains (exponential) moving average of a set of parameters.

__init__(parameters: Parameter, decay: float, use_num_updates: bool = True) → None[source]#

Initialize the EMA class.

Parameters:

parameters (torch.nn.parameter.Parameter) – Iterable of torch.nn.Parameter, initial parameters to use for EMA.
decay (float) – Decay rate for exponential moving average.
use_num_updates (bool, optional) – if True, initialize the number of updates to 0. Defaults to True.

Raises:

ValueError – raise an error if decay is not between 0 and 1.

update(parameters: Parameter) → None[source]#

Update currently maintained parameters. Call this every time the parameters are updated, such as the result of the optimizer.step() call.

Parameters:

parameters (torch.nn.parameter.Parameter) – Iterable of torch.nn.Parameter; usually the same set of
object. (parameters used to initialize this) –

copy_to(parameters: Parameter) → None[source]#

Copy current parameters into given collection of parameters.

Parameters:

parameters (torch.nn.parameter.Parameter) – Iterable of torch.nn.Parameter; the parameters to be
averages. (updated with the stored moving) –

store(parameters: Parameter) → None[source]#

Save the current parameters for restoring later.

Parameters:

parameters (torch.nn.parameter.Parameter) – Iterable of torch.nn.Parameter; the parameters to be
stored. (temporarily) –

restore(parameters: Parameter) → None[source]#

Restore the parameters stored with the store method. Useful to validate the model with EMA parameters without affecting the original optimization process. Store the parameters before the copy_to method. After validation (or model saving), use this to restore the former parameters.

Parameters:

parameters (torch.nn.parameter.Parameter) – Iterable of torch.nn.Parameter; the parameters to be
parameters. (updated with the stored) –

state_dict() → Dict[str, Any][source]#

Returns a dictionary containing the state of the EMA.

Returns:: dictionary containing the state of the EMA.
Return type:: Dict[str, Any]

load_state_dict(state_dict: Dict[str, Any]) → None[source]#

Load the dictionary containing the state of the EMA.

Parameters:: state_dict (Dict[str, Any]) – _description_

errors.py: contains custom exceptions.

exception ccsd.src.utils.errors.SymmetryError(message: str = '')[source]#

Bases: Exception

Exception raised for when a matrix is not symmetric.

message -- more detailed explanation of the error

__init__(message: str = '') → None[source]#

Raises a SymmetryError.

Parameters:: message (str, optional) – more detailed explanation of the error. Defaults to “”.

graph_utils.py: utility functions for graph data (flag masking, quantization, etc.).

Adapted from Jo, J. & al (2022), almost left untouched.

ccsd.src.utils.graph_utils.mask_x(x: Tensor, flags: Tensor | None = None) → Tensor[source]#

Mask batch of node features with 0-1 flags tensor

Parameters:

x (torch.Tensor) – batch of node features
flags (Optional[torch.Tensor], optional) – 0-1 flags tensor. Defaults to None.

Returns:

Mask batch of node features

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.mask_adjs(adjs: Tensor, flags: Tensor | None = None) → Tensor[source]#

Mask batch of adjacency matrices with 0-1 flags tensor

Parameters:

adjs (torch.Tensor) – batch of adjacency matrices. B x N x N or B x C x N x N
flags (Optional[torch.Tensor], optional) – 0-1 flags tensor. Defaults to None. B x N

Returns:

Mask batch of adjacency matrices

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.node_flags(adj: Tensor, eps: float = 1e-05) → Tensor[source]#

Create flags tensor from graph dataset

Parameters:

adj (torch.Tensor) – adjacency matrix
eps (float, optional) – threshold. Defaults to 1e-5.

Returns:

flags tensor

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.init_features(init: str, adjs: Tensor, nfeat: int = 10) → Tensor[source]#

Create initial node features by initaliazing the adjacency matrix, creating a node flag matrix based on the initialization, and masking the node features with the node flag matrix

Parameters:

init (str) – node feature initialization method
adjs (torch.Tensor, optional) – adjacency matrix.
nfeat (int, optional) – number of different features. Defaults to 10.

Raises:

ValueError – If number of features is larger than number of classes
NotImplementedError – initialization method not implemented

Returns:

node features tensor

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.init_flags(graph_list: List[Graph], config: EasyDict, batch_size: int | None = None) → Tensor[source]#

Sample initial flags tensor from the training graph set

Parameters:

graph_list (List[nx.Graph]) – list of graphs
config (EasyDict) – _description_
batch_size (Optional[int], optional) – batch size. Defaults to None.

Returns:

flag tensors

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.gen_noise(x: Tensor, flags: Tensor | None = None, sym: bool = True) → Tensor[source]#

Generate noise

Parameters:

x (torch.Tensor) – input tensor
flags (Optional[torch.Tensor], optional) – optional flags. Defaults to None.
sym (bool, optional) – symetric noise (for adjacency matrix). Defaults to True.

Returns:

generated noisy tensor

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.quantize(t: Tensor, thr: float = 0.5) → Tensor[source]#

Quantize (clip) generated graphs regarding a threshold

Parameters:

t (torch.Tensor) – original adjacency or rank2 incidence matrix
thr (float, optional) – threshold. Defaults to 0.5.

Returns:

quantized/cropped/clipped an adjacency or rank2 incidence matrix

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.quantize_mol(adjs: Tensor | ndarray) → ndarray[source]#

Quantize generated molecules

Parameters:: adjs (Union[torch.Tensor, np.ndarray]) – adjacency matrix adjs: 32 x 9 x 9
Returns:: quantized array for molecules
Return type:: np.ndarray

ccsd.src.utils.graph_utils.adjs_to_graphs(adjs: Tensor | List[Tensor] | List[ndarray] | List[List[List[int | float]]], is_cuda: bool = False) → List[Graph][source]#

Convert generated adjacency matrices to networkx graphs

Parameters:

adjs (Union[torch.Tensor, List[torch.Tensor], List[np.ndarray], List[List[List[Union[int, float]]]]]) – Adjaency matrices
is_cuda (bool, optional) – are the tensor on CPU?. Defaults to False.

Returns:

list of graph representations

Return type:

List[nx.Graph]

ccsd.src.utils.graph_utils.check_sym(adjs: Tensor, print_val: bool = False, epsilon: float = 0.01) → None[source]#

Check if the adjacency matrices are symmetric

Parameters:

adjs (torch.Tensor) – adjacency matrices
print_val (bool, optional) – whether or not we print the symmetry error. Defaults to False.
epsilon (float, optional) – theshold for the sum of the absolute errors. Defaults to 1e-2.

Raises:

SymmetryError – If the sum of the absolute errors is greater than epsilon

ccsd.src.utils.graph_utils.pow_tensor(x: Tensor, cnum: int) → Tensor[source]#

Create higher order adjacency matrices

Parameters:

x (torch.Tensor) – input tensor of shape B x N x N
cnum (int) – number of higher order matrices to create (made with powers of x)

Returns:

output higher order matrices of shape B x cnum x N x N

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.pad_adjs(ori_adj: ndarray, node_number: int) → ndarray[source]#

Create padded adjacency matrices

Parameters:

ori_adj (np.ndarray) – original adjacency matrix
node_number (int) – number of desired nodes

Raises:

ValueError – if the original adjacency matrix is larger than the desired number of nodes (we can’t pad)

Returns:

Padded adjacency matrix

Return type:

np.ndarray

ccsd.src.utils.graph_utils.graphs_to_tensor(graph_list: List[Graph], max_node_num: int) → Tensor[source]#

Convert a list of graphs to a tensor

Parameters:

graph_list (List[nx.Graph]) – List of graphs to convert to adjacency matrices tensors
max_node_num (int) – max number of nodes in all the graphs

Returns:

Tensor of adjacency matrices

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.graphs_to_adj(graph: Graph, max_node_num: int) → Tensor[source]#

Convert a graph to an adjacency matrix

Parameters:

graph (nx.Graph) – graph to convert to an adjacency matrix tensor
max_node_num (int) – maximum number of nodes

Returns:

Adjacency matrix as a tensor

Return type:

torch.Tensor

ccsd.src.utils.graph_utils.node_feature_to_matrix(x: Tensor) → Tensor[source]#

Convert a node feature matrix to a node pair feature matrix. Squared matrices where coeff i, j: concatenation of coeff i and coeff j of the associated node feature matrix

Parameters:: x (torch.Tensor) – B x N x F (F feature space)
Returns:: converted node feature matrix to node pair feature matrix with shape B x N x N x 2F
Return type:: torch.Tensor

ccsd.src.utils.graph_utils.nxs_to_mols(graphs: List[Graph]) → List[Mol][source]#

Convert a list of nx graphs to a list of rdkit molecules

Parameters:: graphs (List[nx.Graph]) – list of nx graphs
Returns:: list of rdkit molecules
Return type:: List[Chem.Mol]

loader.py: code for loading the model, the optimizer, the scheduler, the loss function, etc

Adapted from Jo, J. & al (2022)

ccsd.src.utils.loader.load_seed(seed: int) → int[source]#

Apply the random seed to all libraries (torch, numpy, random) and make sure that the results are reproducible.

Parameters:: seed (int) – seed to use
Returns:: return the seed
Return type:: int

ccsd.src.utils.loader.load_device() → str | List[int][source]#

Check if cuda is available and then return the device(s) to use

Returns:: device(s) to use
Return type:: Union[str, List[int]]

ccsd.src.utils.loader.load_model(params: Dict[str, Any]) → Module[source]#

Load the Score Network model from the parameters

Parameters:: params (dict) – parameters to use
Raises:: ValueError – raise an error if the model is unknown
Returns:: Score Network model to use
Return type:: torch.nn.Module

ccsd.src.utils.loader.load_model_optimizer(params: Dict[str, Any], config_train: EasyDict, device: str | List[str] | List[int]) → Tuple[Module | DataParallel, Optimizer, LRScheduler][source]#

Return the model, the optimizer and the scheduler in function of the parameters

Parameters:

params (Dict[str, Any]) – model parameters
config_train (EasyDict) – configuration for training
device (Union[str, List[str], List[int]]) – device to use

Returns:

return the model, the optimizer and the scheduler

Return type:

Tuple[Union[torch.nn.Module, torch.nn.DataParallel], torch.optim.Optimizer, torch.optim.lr_scheduler.LRScheduler]

ccsd.src.utils.loader.load_ema(model: Module, decay: float = 0.999) → ExponentialMovingAverage[source]#

Create an exponential moving average object for the model’s parameters

Parameters:

model (torch.nn.Module) – model used to train the model
decay (float, optional) – decay parameter. Defaults to 0.999.

Returns:

exponential moving average object for the model’s parameters

Return type:

ExponentialMovingAverage

ccsd.src.utils.loader.load_ema_from_ckpt(model: Module, ema_state_dict: Dict[str, Any], decay: float = 0.999) → ExponentialMovingAverage[source]#

Load the exponential moving average object for the model’s parameters from a checkpoint

Parameters:

model (torch.nn.Module) – model used to train the model
ema_state_dict (Dict[str, Any]) – parameters of the exponential moving average
decay (float, optional) – decay parameter. Defaults to 0.999.

Returns:

exponential moving average object for the model’s parameters

Return type:

ExponentialMovingAverage

ccsd.src.utils.loader.load_data(config: EasyDict, get_list: bool = False, is_cc: bool = False) → Tuple[DataLoader, DataLoader] | Tuple[List[Graph], List[Graph]] | Tuple[List[CombinatorialComplex], List[CombinatorialComplex]][source]#

Return a DataLoader object for training based on the configuration

Parameters:

config (EasyDict) – configuration for training
get_list (bool, optional) – if True, returns lists of graph or combinatorial complexes instead of dataloaders. Defaults to False.
is_cc (bool, optional) – if True, the dataset is made of combinatorial complexes. Defaults to False.

Returns:

DataLoader object or list of objects for training

Return type:

Union[Tuple[DataLoader, DataLoader], Union[Tuple[List[nx.Graph], List[nx.Graph]], Tuple[List[CombinatorialComplex], List[CombinatorialComplex]]]]

ccsd.src.utils.loader.load_batch(batch: List[Tensor], device: str | List[str], is_cc: bool = False) → Tuple[Tensor, Tensor] | Tuple[Tensor, Tensor, Tensor][source]#

Load the batch on the device

Parameters:

batch (List[torch.Tensor]) – input batch
device (Union[str, List[str]]) – device to use
is_cc (bool, optional) – if True, the elements of the input batch are combinatorial complexes. Defaults to False.

Returns:

input batch on the device

Return type:

Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]

ccsd.src.utils.loader.load_sde(config_sde: EasyDict) → SDE[source]#

Load the stochastic differential equation (SDE) from the configuration

Parameters:: config_sde (EasyDict) – configuration for the SDE
Raises:: NotImplementedError – raise an error if the SDE is unknown
Returns:: SDE to use
Return type:: SDE

ccsd.src.utils.loader.load_loss_fn(config: EasyDict, is_cc: bool = False) → Callable[[Module, Module, Tensor, Tensor], Tuple[Tensor, Tensor]] | Callable[[Module, Module, Module, Tensor, Tensor, Tensor], Tuple[Tensor, Tensor, Tensor]][source]#

Load the loss function from the configuration

Parameters:

config (EasyDict) – configuration to use
is_cc (bool, optional) – if True, loss function for combinatorial complexes. Defaults to False.

Returns:

loss function that returns 2 or 3 losses, for x, adj and rank2 if cc

Return type:

Union[Callable[[torch.nn.Module, torch.nn.Module, torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor]], Callable[[torch.nn.Module, torch.nn.Module, torch.nn.Module, torch.Tensor, torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]]

ccsd.src.utils.loader.load_sampling_fn(config_train: EasyDict, config_module: EasyDict, config_sample: EasyDict, device: str | List[str], is_cc: bool = False, d_min: int | None = None, d_max: int | None = None, divide_batch: int | None = None) → Callable[[Module, Module, Tensor], Tuple[Tensor, Tensor, float]] | Callable[[Module, Module, Module, Tensor], Tuple[Tensor, Tensor, Tensor, float]][source]#

Load the sampling function from the configuration

Parameters:

config_train (EasyDict) – configuration for training
config_module (EasyDict) – configuration for the module
config_sample (EasyDict) – configuration for the sampling
device (Union[str, List[str]]) – device to use
is_cc (bool, optional) – if True, we sample combinatorial complexes. Defaults to False.
d_min (Optional[int], optional) – minimum size of rank2 cells (for cc). Defaults to None.
d_max (Optional[int], optional) – maximum size of rank2 cells (for cc). Defaults to None.
divide_batch (Optional[int], optional) – if not None, divide the samples by this number to bypass RAM saturation. Defaults to None.

Returns:

sampling function

Return type:

Union[Callable[[torch.nn.Module, torch.nn.Module, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, float]], Callable[[torch.nn.Module, torch.nn.Module, torch.nn.Module, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, float]]]

ccsd.src.utils.loader.load_model_params(config: EasyDict, is_cc: bool = False) → Tuple[Dict[str, Any], Dict[str, Any]] | Tuple[Dict[str, Any], Dict[str, Any], Dict[str, Any]][source]#

Load the model parameters from the configuration

Parameters:

config (EasyDict) – configuration to use
is_cc (bool, optional) – whether to model using combinatorial complexes. Defaults to False.

Returns:

parameters for x, adj, and rank-2 cells if cc

Return type:

Union[Tuple[Dict[str, Any], Dict[str, Any]], Tuple[Dict[str, Any], Dict[str, Any], Dict[str, Any]]]

ccsd.src.utils.loader.load_ckpt(config: EasyDict, device: str | List[str], ts: str | None = None, return_ckpt: bool = False, is_cc: bool = False) → Dict[str, Any][source]#

Load the checkpoint from the configuration

Parameters:

config (EasyDict) – configuration to use
device (Union[str, List[str]]) – device to use
ts (Optional[str], optional) – timestamp (checkpoint name). Defaults to None.
return_ckpt (bool, optional) – if True, add the checkpoint in the resulting dictionary (key: “ckpt”). Defaults to False.
is_cc (bool, optional) – whether to model using combinatorial complexes. Defaults to False.

Returns:

loaded checkpoint parameters and configuration

Return type:

Dict[str, Any]

ccsd.src.utils.loader.load_model_from_ckpt(params: Dict[str, Any], state_dict: Dict[str, Any], device: str | List[device] | List[int]) → Module | DataParallel[source]#

Load the model from the checkpoint

Parameters:

params (Dict[str, Any]) – parameters of the model
state_dict (Dict[str, Any]) – state dictionary of the model
device (Union[str, List[str], List[int]]) – device to use

Returns:

loaded model

Return type:

Union[torch.nn.Module, torch.nn.DataParallel]

ccsd.src.utils.loader.load_eval_settings(data: str, orbit_on: bool = True) → Tuple[List[str], Dict[str, Callable[[ndarray, ndarray], float]]][source]#

Load the evaluation settings from the configuration

Parameters:

data (str) – dataset to use. UNUSED HERE.
orbit_on (bool, optional) – whether to use orbit distance. UNUSED HERE. Defaults to True.

Returns:

methods and kernels, used for generic graph generation

Return type:

Tuple[List[str], Dict[str, Callable[[np.ndarray, np.ndarray], float]]]

logger.py: utility functions for logging.

Adapted from Jo, J. & al (2022), almost left untouched.

class ccsd.src.utils.logger.Logger(filepath: str, mode: str, lock: Any | None = None)[source]#

Bases: object

Logger class for logging to a file.

__init__(filepath: str, mode: str, lock: Any | None = None) → None[source]#

Initialize the Logger class.

Parameters:

filepath (str) – the file where to write
mode (str) – can be ‘w’ or ‘a’
lock (Optional[Any], optional) – pass a shared lock for multi process write access. Defaults to None.

log(str: str, verbose: bool = True) → None[source]#

Log a string to the file and optionally print it

Parameters:

str (str) – string to log
verbose (bool, optional) – whether or not we print the message. Defaults to True.

ccsd.src.utils.logger.set_log(config: EasyDict, is_train: bool = True, folder: str = './') → Tuple[str, str, str][source]#

Set the log folder name, log directory and checkpoint directory

Parameters:

config (EasyDict) – the config object
is_train (bool, optional) – True if we are training, False if we are sampling. Defaults to True.
folder (str, optional) – the general saving folder. Defaults to “./”.

Returns:

the name of the folder, the log directory and the checkpoint directory of the log

Return type:

Tuple[str, str, str]

ccsd.src.utils.logger.check_log(log_folder_name: str, log_name: str) → bool[source]#

Check if a log file exists

Parameters:

log_folder_name (str) – given log folder name
log_name (str) – given log name

Returns:

True if the log file exists, False otherwise

Return type:

bool

ccsd.src.utils.logger.data_log(logger: Logger, config: EasyDict) → None[source]#

Log the current configuration

Parameters:

logger (Logger) – Logger object
config (EasyDict) – current configuration used

ccsd.src.utils.logger.sde_log(logger: Logger, config_sde: EasyDict, is_cc: bool = False) → None[source]#

Log the current SDE configuration

Parameters:

logger (Logger) – Logger object
config_sde (EasyDict) – sde configuration
is_cc (bool, optional) – True if we are modelling with combinatorial complexes. Defaults to False.

ccsd.src.utils.logger.model_log(logger: Logger, config: EasyDict, is_cc: bool = False) → None[source]#

Log the current model configuration

Parameters:

logger (Logger) – Logger object
config (EasyDict) – current configuration used
is_cc (bool, optional) – True if we are modelling with combinatorial complexes. Defaults to False.

ccsd.src.utils.logger.device_log(logger: Logger, device: str | List[int] | List[str] | List[device]) → None[source]#

Log the device(s) that will be used as detected by PyTorch

Parameters:

logger (Logger) – Logger object
device (Union[str, List[int], List[str], List[torch.device]]) – device(s) used as detected

ccsd.src.utils.logger.start_log(logger: Logger, config: EasyDict) → None[source]#

Log initial message with the configuration

Parameters:

logger (Logger) – Logger object
config (EasyDict) – configuration used

ccsd.src.utils.logger.train_log(logger: Logger, config: EasyDict) → None[source]#

Log configuration used for training

Parameters:

logger (Logger) – Logger object
config (EasyDict) – configuration used

ccsd.src.utils.logger.sample_log(logger: Logger, config: EasyDict) → None[source]#

Log configuration used for sampling

Parameters:

logger (Logger) – Logger object
config (EasyDict) – configuration used

ccsd.src.utils.logger.model_parameters_log(logger: Logger, models: List[Module]) → None[source]#

Print the number of parameters of the models and the total number of parameters.

Parameters:

logger (Logger) – Logger object
models (List[torch.nn.Module]) – list of models.

ccsd.src.utils.logger.time_log(logger: Logger, time_type: str, elapsed_time: float) → None[source]#

Log the time elapsed since the start of the training/sampling

Parameters:

logger (Logger) – Logger object
time_type (str) – type of time. Must be in [“train”, “sample”].
elapsed_time (float) – elapsed time since the start of the training/sampling

Raises:

ValueError – raise an error if time_type is not in [“train”, “sample”]

models_utils.py: utility functions related to the models.

ccsd.src.utils.models_utils.get_model_device(model: Module | DataParallel) → str[source]#

Get the the device on which the model is loaded (“cpu”, “cuda”, etc?)

Parameters:: model (Union[torch.nn.Module, torch.nn.DataParallel]) – Pytorch model
Returns:: device on which the model is loaded
Return type:: str

ccsd.src.utils.models_utils.get_nb_parameters(model: Module) → int[source]#

Get the number of parameters of the model.

Parameters:: model (torch.nn.Module) – model.
Returns:: number of parameters of the model.
Return type:: int

ccsd.src.utils.models_utils.get_ones_cache(shape: Sequence[int], device: str) → Tensor[source]#

Cached function to get a tensor of ones of the given shape and device.

Parameters:

shape (Sequence[int]) – shape of the tensor
device (str) – device on which the tensor should be allocated

Returns:

tensor of ones of the given shape and device

Return type:

torch.Tensor

ccsd.src.utils.models_utils.get_ones(shape: Sequence[int], device: str) → Tensor[source]#

Function to get a tensor of ones of the given shape and device. Call the cached version of the function and clone it.

Parameters:

shape (Sequence[int]) – shape of the tensor
device (str) – device on which the tensor should be allocated

Returns:

tensor of ones of the given shape and device

Return type:

torch.Tensor

mol_utils.py: utility functions for loading the molecular data, checking the validity of the molecules, converting them, saving them, etc.

Adapted from Jo, J. & al (2022), almost left untouched.

ccsd.src.utils.mol_utils.is_molecular_config(config: EasyDict) → bool[source]#

Checks if the config is for a molecular dataset. Right now, it only checks if the dataset is QM9 or ZINC250k.

Parameters:: config (EasyDict) – config to check
Returns:: whether or not the config is for a molecular dataset
Return type:: bool

ccsd.src.utils.mol_utils.mols_to_smiles(mols: List[Mol]) → List[str][source]#

Converts a list of RDKit molecules to a list of SMILES strings.

Parameters:: mols (List[Chem.Mol]) – molecules to convert
Returns:: SMILES strings
Return type:: List[str]

ccsd.src.utils.mol_utils.smiles_to_mols(smiles: List[str]) → List[Mol][source]#

Converts a list of SMILES strings to a list of RDKit molecules.

Parameters:: smiles (List[str]) – SMILES strings to convert
Returns:: molecules
Return type:: List[Chem.Mol]

ccsd.src.utils.mol_utils.canonicalize_smiles(smiles: List[str]) → List[str][source]#

Canonicalizes a list of SMILES strings.

Parameters:: smiles (List[str]) – SMILES strings to canonicalize
Returns:: canonicalized SMILES strings
Return type:: List[str]

ccsd.src.utils.mol_utils.load_smiles(dataset: str = 'QM9', folder: str = './') → Tuple[List[str], List[str]][source]#

Loads SMILES strings from a dataset and return train and test splits.

Parameters:

dataset (str, optional) – smiles dataset to load. Defaults to “QM9”.
folder (str, optional) – folder where the data folder is located. Defaults to “./”.

Raises:

ValueError – raise an error if dataset is not supported

Returns:

train and test splits

Return type:

Tuple[List[str], List[str]]

ccsd.src.utils.mol_utils.construct_mol(x: ndarray, adj: ndarray, atomic_num_list: List[int]) → Mol[source]#

Constructs molecule(s) from the model output.

Parameters:

x (np.ndarray) – node features
adj (np.ndarray) – adjacency matrix
atomic_num_list (List[int]) – atomic number list

Returns:

molecule

Return type:

Chem.Mol

ccsd.src.utils.mol_utils.gen_mol(x: Tensor, adj: Tensor, dataset: str, largest_connected_comp: bool = True) → Tuple[List[Mol], int][source]#

Generates molecules from the model output and returns valid molecules and the number of molecules that are not corrected.

Parameters:

x (torch.Tensor) – node features
adj (torch.Tensor) – adjacency matrix
dataset (str) – dataset name
largest_connected_comp (bool, optional) – whether or not we keep only the largest connected component. Defaults to True.

Returns:

valid molecules and the number of molecules that are not corrected

Return type:

Tuple[List[Chem.Mol], int]

ccsd.src.utils.mol_utils.check_valency(mol: Mol | RWMol) → Tuple[bool, List[int] | None][source]#

Checks the valency of the molecule.

Parameters:: mol (Union[Chem.Mol, Chem.RWMol]) – molecule
Returns:: whether or not the molecule is valid and the atom id and valency of the atom that is not valid
Return type:: Tuple[bool, Optional[List[int]]]

ccsd.src.utils.mol_utils.correct_mol(m: RWMol) → Tuple[RWMol, bool][source]#

Corrects the molecule.

Parameters:: m (Chem.RWMol) – molecule
Returns:: corrected molecule and whether or not the molecule is corrected
Return type:: Tuple[Chem.RWMol, bool]

ccsd.src.utils.mol_utils.valid_mol_can_with_seg(m: Mol | None, largest_connected_comp: bool = True) → Mol | None[source]#

Returns a valid molecule with the largest connected component (in option).

Parameters:

m (Optional[Chem.Mol]) – molecule
largest_connected_comp (bool, optional) – whether or not we keep only the largest connected component. Defaults to True.

Returns:

valid molecule

Return type:

Optional[Chem.Mol]

ccsd.src.utils.mol_utils.mols_to_nx(mols: List[Mol]) → List[Graph][source]#

Converts a list of molecules to a list of networkx graphs.

Parameters:: mols (List[Chem.Mol]) – list of molecules
Returns:: list of networkx graphs
Return type:: List[nx.Graph]

plot.py: utility functions for plotting.

ccsd.src.utils.plot.save_fig(config: EasyDict, save_dir: str | None = None, title: str = 'fig', dpi: int = 300, is_sample: bool = True) → None[source]#

Function to adjust the figure and save it.

Adapted from Jo, J. & al (2022)

Parameters:

config (EasyDict) – configuration file
save_dir (Optional[str], optional) – directory to save the figures. Defaults to None.
title (str, optional) – name of the file. Defaults to “fig”.
dpi (int, optional) – DPI (Dots per Inch). Defaults to 300.
is_sample (bool, optional) – whether the figure is generated during the sample phase. Defaults to True.

ccsd.src.utils.plot.plot_graphs_list(config: EasyDict, graphs: List[Graph | Dict[str, Any]], title: str = 'title', max_num: int = 16, save_dir: str | None = None, N: int = 0) → None[source]#

Plot a list of graphs.

Adapted from Jo, J. & al (2022)

Parameters:

config (EasyDict) – configuration file
graphs (List[Union[nx.Graph, Dict[str, Any]]]) – graphs to plot
title (str, optional) – title of the plot. Defaults to “title”.
max_num (int, optional) – number of graphs to plot (must lower or equal than batch size). Defaults to 16.
save_dir (Optional[str], optional) – directory to save the figures. Defaults to None.
N (int, optional) – parameter to skip the first graphs of the list. Defaults to 0.

ccsd.src.utils.plot.save_graph_list(config: EasyDict, log_folder_name: str, exp_name: str, gen_graph_list: List[Graph]) → str[source]#

Save the generated graphs in a pickle file.

Adapted from Jo, J. & al (2022)

Parameters:

config (EasyDict) – configuration file
log_folder_name (str) – name of the folder where the pickle file will be saved
exp_name (str) – name of the experiment
gen_graph_list (List[nx.Graph]) – list of generated graphs

Returns:

path to the pickle file

Return type:

str

ccsd.src.utils.plot.plot_cc_list(config: EasyDict, ccs: List[CombinatorialComplex | Dict[str, Any]], title: str = 'title', max_num: int = 16, save_dir: str | None = None, N: int = 0) → None[source]#

Plot a list of combinatorial complexes (represented here as hypergraphs), using hypernetx, for complexes of dimension 2.

Parameters:

ccs (List[Union[CombinatorialComplexes, Dict[str, Any]]]) – combinatorial complexes to plot
title (str, optional) – title of the plot. Defaults to “title”.
max_num (int, optional) – number of combinatorial complexes to plot (must lower or equal than batch size). Defaults to 16.
save_dir (Optional[str], optional) – directory to save the figures. Defaults to None.
N (int, optional) – parameter to skip the first graphs of the list. Defaults to 0.

ccsd.src.utils.plot.save_cc_list(config: EasyDict, log_folder_name: str, exp_name: str, gen_cc_list: List[CombinatorialComplex]) → str[source]#

Save the generated combinatorial complexes in a pickle file.

Parameters:

config (EasyDict) – configuration file
log_folder_name (str) – name of the folder where the pickle file will be saved
exp_name (str) – name of the experiment
gen_cc_list (List[CombinatorialComplex]) – list of generated ccs

Returns:

path to the pickle file

Return type:

str

ccsd.src.utils.plot.plot_molecule_list(config: EasyDict, mols: List[Mol], title: str = 'title', max_num: int = 16, save_dir: str | None = None, N: int = 0) → None[source]#

Plot a list of molecules, using rdkit.

Parameters:

config (EasyDict) – configuration file
mols (List[Chem.Mol]) – molecules to plot
title (str, optional) – title of the plot. Defaults to “title”.
max_num (int, optional) – number of molecules to plot (must lower or equal than batch size). Defaults to 16.
save_dir (Optional[str], optional) – directory to save the figures. Defaults to None.
N (int, optional) – parameter to skip the first graphs of the list. Defaults to 0.

ccsd.src.utils.plot.save_molecule_list(config: EasyDict, log_folder_name: str, exp_name: str, gen_mol_list: List[Mol]) → str[source]#

Save the generated molecules in a pickle file.

Parameters:

config (EasyDict) – configuration file
log_folder_name (str) – name of the folder where the pickle file will be saved
exp_name (str) – name of the experiment
gen_mol_list (List[Chem.Mol]) – list of generated molecules

Returns:

path to the pickle file

Return type:

str

ccsd.src.utils.plot.plot_lc(config: EasyDict, learning_curves: Dict[str, List[float]], f_dir: str = './', filename: str = 'learning_curves', cols: int = 3) → None[source]#

Plot the learning curves.

Parameters:

config (EasyDict) – configuration file
learning_curves (Dict[str, List[float]]) – dictionary containing the learning curves
f_dir (str, optional) – directory to save the figure. Defaults to “./”.
filename (str, optional) – name of the figure. Defaults to “learning_curves”.
cols (int, optional) – number of columns in the figure. Defaults to 3.

ccsd.src.utils.plot.plot_3D_molecule(molecule: Mol, atomic_radii: Dict[str, float] | None = None, cpk_colors: Dict[str, str] | None = None) → Figure[source]#

Creates a 3D plot of the molecule.

Parameters:

molecule (Chem.Mol) – The RDKit molecule to plot.
atomic_radii (Optional[Dict[str, float]], optional) – Dictionary mapping atomic symbols to atomic radii. Defaults to None.
cpk_colors (Optional[Dict[str, str]], optional) – Dictionary mapping atomic symbols to CPK colors. Defaults to None.

Returns:

The 3D plotly figure of the molecule.

Return type:

plotly.graph_objs.Figure

ccsd.src.utils.plot.rotate_molecule_animation(figure: Figure, filedir: str, filename: str, duration: float = 1.0, frames: int = 30, rotations_per_sec: float = 1.0, overwrite: bool = False, engine: str = 'kaleido') → None[source]#

Creates an animated GIF of the molecule rotating.

Parameters:

figure (plotly.graph_objs.Figure) – The 3D plotly figure of the molecule.
filedir (str) – The directory to save the animated GIF.
filename (str) – The filename of the output animated GIF.
duration (float, optional) – Duration of the animation in seconds. Defaults to 1.0.
frames (int, optional) – Number of frames in the animation. Defaults to 30.
rotations_per_sec (float, optional) – Number of rotations per second. Defaults to 1.0.
overwrite (bool, optional) – If True, overwrite the file if it already exists. Defaults to False.
engine (str, optional) – engine to use for the .write_image plotly method. Defaults to “kaleido”.

ccsd.src.utils.plot.plot_diffusion_trajectory(gen_obj: List[Tensor], is_molecule: bool = False, dataset: str = 'QM9', largest_connected_comp: bool = True) → Figure | Figure[source]#

Return the figure of one generated object as part of a diffusion trajectory.

Parameters:

gen_obj (List[torch.Tensor]) – The generated object (node features (x) and adjacency matrix (adj), and rank-2 incidence matrix (rank2) if we generated combinatorial complexes).
is_molecule (bool, optional) – if True, we plot a molecule, otherwise a graph. Defaults to False.
dataset (str, optional) – The dataset from which the object was generated. Defaults to “QM9” (only used if is_molecule=True).
largest_connected_comp (bool, optional) – whether or not we keep only the largest connected component. Defaults to True.

Returns:

The figure of the generated object.

Return type:

Union[plotly.graph_objs.Figure, matplotlib.figure.Figure]

ccsd.src.utils.plot.diffusion_animation(diff_traj: List[List[Tensor]], is_molecule: bool = False, filedir: str = './', filename: str = 'diffusion_animation', fps: int = 25, overwrite: bool = True, engine: str = 'kaleido', duration: float = 4.0, cropped: bool = False) → None[source]#

Creates an animated GIF of the diffusion trajectory.

Parameters:

diff_traj (List[List[torch.Tensor]]) – The diffusion trajectory (list of generated node features (x) and adjacency matrices (adj), and rank-2 incidence matrices (rank2) if we generated combinatorial complexes).
is_molecule (bool, optional) – If True, the frames are molecules not graphs. Defaults to False.
filedir (str, optional) – The directory to save the animated GIF. Defaults to “./”.
filename (str, optional) – The filename of the output animated GIF. Defaults to “diffusion_animation”.
fps (int, optional) – Number of frames per second. Defaults to 25.
overwrite (bool, optional) – If True, overwrite the file if it already exists. Defaults to True.
engine (str, optional) – engine to use for the .write_image plotly method if plotly is used. Defaults to “kaleido”.
duration (float, optional) – duration of the animation (in seconds). Defaults to 4.0.
cropped (bool, optional) – if True, we select the first frames. Otherwise, we skip some frames to build the animation. Defaults to False.

print.py: utility functions for printing to the console.

ccsd.src.utils.print.get_ascii_logo(ascii_logo_path: str = 'ascii_logo.txt') → str[source]#

Get the ascii logo.

Parameters:: ascii_logo_path (str, optional) – path of the logo. Defaults to “ascii_logo.txt”.
Returns:: the ascii logo.
Return type:: str

ccsd.src.utils.print.get_experiment_desc(args: Namespace | Dict[str, Any]) → str[source]#

Get the experiment description.

Parameters:: args (Union[argparse.Namespace, Dict[str, Any]]) – parsed arguments for the experiment.
Returns:: the experiment description.
Return type:: str

ccsd.src.utils.print.initial_print(args: Namespace | Dict[str, Any], ascii_logo_path: str = 'ascii_logo.txt') → None[source]#

Print the initial message to the console.

Parameters:

args (Union[argparse.Namespace, Dict[str, Any]]) – parsed arguments for the experiment.
ascii_logo_path (str, optional) – path of the logo. Defaults to “ascii_logo.txt”.

time_utils.py: utility functions for time operations.

ccsd.src.utils.time_utils.get_time(timezone: str = 'Europe/London') → str[source]#