Others#

All the other modules that are not part of the main program such as the parsers, the evaluation functions, the data generation functions, etc.

Most of these functions has been adapted from MoFlow (https://github.com/calvin-zcx/moflow) and Jo, J. & al (2022) (https://github.com/harryjo97/GDSS).

parser.py: code for parsing the arguments of the main script (experiments).

Adapted from Jo, J. & al (2022)

Almost left untouched.

class ccsd.src.parsers.parser.Parser[source]#

Bases: object

Parser class to parse the arguments to run the experiments.

__init__() → None[source]#: Initialize the parser.

set_arguments() → None[source]#: Set the arguments for the parser.

parse() → Namespace[source]#

Parse the arguments and check for unknown arguments.

Raises:: SystemExit – raise an error if there are unknown arguments.
Returns:: parsed arguments.
Return type:: argparse.Namespace

parser_generator.py: code for parsing the arguments of the graph dataset generator.

class ccsd.src.parsers.parser_generator.ParserGenerator[source]#

Bases: object

ParserGenerator class to parse the arguments to create graph datasets.

__init__() → None[source]#: Initialize the parser.

set_arguments() → None[source]#: Set the arguments for the parser.

parse() → Namespace[source]#

Parse the arguments and check for unknown arguments.

Raises:: SystemExit – raise an error if there are unknown arguments.
Returns:: parsed arguments.
Return type:: argparse.Namespace

parser_preprocess.py: code for parsing the arguments of the scripts that preprocess the molecule datasets.

class ccsd.src.parsers.parser_preprocess.ParserPreprocess[source]#

Bases: object

ParserPreprocess class to parse the arguments of the scripts that preprocess the molecule datasets.

__init__() → None[source]#: Initialize the parser.

set_arguments() → None[source]#: Set the arguments for the parser.

parse() → Namespace[source]#

Parse the arguments and check for unknown arguments.

Raises:: SystemExit – raise an error if there are unknown arguments.
Returns:: parsed arguments.
Return type:: argparse.Namespace

config.py: code for loading the config file.

Adapted from Jo, J. & al (2022)

ccsd.src.parsers.config.get_config(config: str, seed: int, folder: str = './') → EasyDict[source]#

Load the config file.

Parameters:

config (str) – name of the config file.
seed (int) – random seed (to be added to the config object).
folder (str, optional) – folder where the config folder is located. Defaults to “./”.

Returns:

configuration object.

Return type:

EasyDict

ccsd.src.parsers.config.get_general_config(folder: str = './') → EasyDict[source]#

Get the general configuration.

Parameters:: folder (str, optional) – folder where the config folder is located. Defaults to “./”.
Returns:: general configuration.
Return type:: EasyDict

eden.py: Provides interface for vectorizer.

Code adapted from https://github.com/fabriziocosta/EDeN

Left untouched.

class ccsd.src.evaluation.eden.AbstractVectorizer[source]#

Bases: BaseEstimator, TransformerMixin

Interface declaration for the Vectorizer class.

annotate(graphs, estimator=None, reweight=1.0, relabel=False)[source]#

set_params(**args)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

transform(graphs)[source]#

vertex_transform(graph)[source]#

set_transform_request(*, graphs: bool | None | str = '$UNCHANGED$') → AbstractVectorizer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:: graphs (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for graphs parameter in transform.
Returns:: self – The updated object.
Return type:: object

ccsd.src.evaluation.eden.run_dill_encoded(what)[source]#: Use dill as replacement for pickle to enable multiprocessing on instance methods

ccsd.src.evaluation.eden.apply_async(pool, fun, args, callback=None)[source]#: Wrapper around apply_async() from multiprocessing, to use dill instead of pickle. This is a workaround to enable multiprocessing of classes.

ccsd.src.evaluation.eden.fast_hash_2(dat_1, dat_2, bitmask=4294967295)[source]#

ccsd.src.evaluation.eden.fast_hash_3(dat_1, dat_2, dat_3, bitmask=4294967295)[source]#

ccsd.src.evaluation.eden.fast_hash_4(dat_1, dat_2, dat_3, dat_4, bitmask=4294967295)[source]#

ccsd.src.evaluation.eden.fast_hash(vec, bitmask=4294967295)[source]#

ccsd.src.evaluation.eden.fast_hash_vec(vec, bitmask=4294967295)[source]#

ccsd.src.evaluation.eden.auto_label(graphs, n_clusters=16, **opts)[source]#: Label nodes with cluster id. Cluster nodes using as features the output of vertex_vectorize.

ccsd.src.evaluation.eden.auto_relabel(graphs, n_clusters=16, **opts)[source]#: Label nodes with cluster id.

ccsd.src.evaluation.eden.vectorize(graphs, **opts)[source]#: Transform real vector labeled, weighted graphs in sparse vectors.

ccsd.src.evaluation.eden.vertex_vectorize(graphs, **opts)[source]#: Transform a list of networkx graphs into a list of sparse matrices.

ccsd.src.evaluation.eden.annotate(graphs, estimator=None, reweight=1.0, vertex_features=False, **opts)[source]#: Return graphs with extra node attributes: importance and features.

ccsd.src.evaluation.eden.kernel_matrix(graphs, **opts)[source]#: Return the kernel matrix.

class ccsd.src.evaluation.eden.Vectorizer(complexity=3, r=None, d=None, min_r=0, min_d=0, weights_dict=None, auto_weights=False, nbits=16, normalization=True, inner_normalization=True, positional=False, discrete=True, use_only_context=False, key_label='label', key_weight='weight', key_nesting='nesting', key_importance='importance', key_class='class', key_vec='vec', key_svec='svec')[source]#

Bases: AbstractVectorizer

Transform real vector labeled, weighted graphs in sparse vectors.

__init__(complexity=3, r=None, d=None, min_r=0, min_d=0, weights_dict=None, auto_weights=False, nbits=16, normalization=True, inner_normalization=True, positional=False, discrete=True, use_only_context=False, key_label='label', key_weight='weight', key_nesting='nesting', key_importance='importance', key_class='class', key_vec='vec', key_svec='svec')[source]#

Constructor.

Parameters:

complexity (int, optional) – The complexity of the features extracted. This is equivalent to setting r = complexity, d = complexity. Defaults to 3.
r (int) – The maximal radius size.
d (int) – The maximal distance size.
min_r (int) – The minimal radius size.
min_d (int) – The minimal distance size.
weights_dict (Dict[Tuple[float, float], float]) – Dictionary with keys = pairs (radius, distance) and value = weights.
auto_weights (bool, optional) – Flag to set to 1 the weight of the kernels for r=i, d=i for i in range(complexity) Defaults to False.
nbits (int, optional) – The number of bits that defines the feature space size: size(feature space)=2^nbits. Defaults to 16.
normalization (bool, optional) – Flag to set the resulting feature vector to have unit euclidean norm. Defaults to True.
inner_normalization (bool, optional) – Flag to set the feature vector for a specific combination of the radius and distance size to have unit euclidean norm. When used together with the ‘normalization’ flag it will be applied first and then the resulting feature vector will be normalized. Defaults to True.
positional (bool, optional) – Flag to make the relative position be sorted by the node ID value. This is useful for ensuring isomorphism for sequences. Defaults to False.
discrete (bool, optional) – Flag to activate more efficient computation of vectorization considering only discrete labels and ignoring vector attributes. Defaults to False.
use_only_context (bool, optional) – Flag to deactivate the central part of the information and retain only the context. Defaults to False.
key_label (string, optional) – The key used to indicate the label information in nodes. Defaults to “label”.
key_weight (string, optional) – The key used to indicate the weight information in nodes. Defaults to “weight”.
key_nesting (string, optional) – The key used to indicate the nesting type in edges. Defaults to “nesting”.
key_importance (string, optional) – The key used to indicate the importance information in nodes. Defaults to “importance”.
key_class (string, optional) – The key used to indicate the predicted class associated to the node. Defaults to “class”.
key_vec (string, optional) – The key used to indicate the vector label information in nodes. Defaults to “vec”.
key_svec (string, optional) – The key used to indicate the sparse vector label information in nodes. Defaults to “svec”.

set_params(**args)[source]#: Set the parameters of the vectorizer.

get_params()[source]#: Get parameters for teh vectorizer. :returns: params – Parameter names mapped to their values. :rtype: mapping of string to any

save(model_name)[source]#: save.

load(obj)[source]#: load.

transform(graphs: List[Graph])[source]#

Transform a list of networkx graphs into a sparse matrix.

Parameters:: graphs (List[nx.Graph]) – The input list of networkx graphs.
Returns:: shape = [n_samples, n_features] Vector representation of input graphs.
Return type:: data_matrix (array-like)

Examples

python

>>> # transforming the same graph
>>> import networkx as nx
>>> def get_path_graph(length=4):
...     g = nx.path_graph(length)
...     for n,d in g.nodes(data=True):
...         d['label'] = 'C'
...     for a,b,d in g.edges(data=True):
...         d['label'] = '1'
...     return g
>>> g = get_path_graph(4)
>>> g2 = get_path_graph(5)
>>> g2.remove_node(4)
>>> v = Vectorizer()
>>> def vec_to_hash(vec):
...     return hash(tuple(vec.data + vec.indices))
>>> vec_to_hash(v.transform([g])) == vec_to_hash(v.transform([g2]))
True

vertex_transform(graphs: List[Graph])[source]#

Transform a list of networkx graphs into a list of sparse matrices. Each matrix has dimension n_nodes x n_features, i.e. each vertex is associated to a sparse vector that encodes the neighborhood of the vertex up to radius + distance.

Parameters:: graphs (List[nx.Graph]) – The input list of networkx graphs.
Returns:: shape = [n_samples, [n_nodes, n_features]] Vector representation of each vertex in the input graphs.
Return type:: matrix_list (array-like)

annotate(graphs, estimator=None, reweight=1.0, threshold=None, scale=1, vertex_features=False)[source]#

Return graphs with extra attributes: importance and features. Given a list of networkx graphs, if the given estimator is not None and is fitted, return a list of networkx graphs where each vertex has additional attributes with key ‘importance’ and ‘weight’. The importance value of a vertex corresponds to the part of the score that is imputable to the neighborhood of radius r+d of the vertex. The weight value is the absolute value of importance. If vertex_features is True then each vertex has additional attributes with key ‘features’ and ‘vector’.

Parameters:

estimator (scikit-learn estimator) – Scikit-learn predictor trained on data sampled from the same distribution. If None the vertex weights are set by default 1.
reweight (float, optional) – The coefficient used to weight the linear combination of the current weight and the absolute value of the score computed by the estimator. If reweight = 0 then do not update. If reweight = 1 then discard the current weight information and use only abs( score ) If reweight = 0.5 then update with the arithmetic mean of the current weight information and the abs( score ) Defaults to 1.0.
threshold (Optional[float], optional) – If not None, threshold the importance value before computing the weight. Defaults to None.
scale (float, optional) – Multiplicative factor to rescale all weights. Defaults to 1.
vertex_features (bool, optional) – Flag to compute the sparse vector encoding of all features that have that vertex as root. An attribute with key ‘features’ is created for each node that contains a CRS scipy sparse vector, and an attribute with key ‘vector’ is created that contains a python dictionary to store the key, values pairs. Defaults to False.

set_transform_request(*, graphs: bool | None | str = '$UNCHANGED$') → Vectorizer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:: graphs (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for graphs parameter in transform.
Returns:: self – The updated object.
Return type:: object

ccsd.src.evaluation.eden.serialize_dict(the_dict, full=True, offset='small')[source]#: serialize_dict.

mmd.py: code for computing MMD (Maximum Mean Discrepancy), kernel based statistical test used to determine whether given two distribution are the same. Also contains functions to calculate the EMD (Earth Mover’s Distance) and the L2 distance between two histograms, in addition to Gaussian kernels with these distances.

Adapted from Jo, J. & al (2022)

ccsd.src.evaluation.mmd.emd(x: ndarray, y: ndarray, distance_scaling: float = 1.0) → float[source]#

Calculate the earth mover’s distance (EMD) between two histograms: It corresponds to the Wasserstein metric (see Optimal transport theory) The formula is (inf_{gama in Gama(mu,

int_{M*M} d(x,y)^p dgama(x,y))^(1/p).
Adapted from From Niu et al. (2020)

Args:
x (np.ndarray): histogram of first distribution y (np.ndarray): histogram of second distribution distance_scaling (float, optional): distance scaling factor. Defaults to 1.0.

Returns:
float: EMD value

ccsd.src.evaluation.mmd.l2(x: ndarray, y: ndarray) → float[source]#

Calculate the L2 distance between two histograms

Parameters:

x (np.ndarray) – histogram of first distribution
y (np.ndarray) – histogram of second distribution

Returns:

L2 distance

Return type:

float

ccsd.src.evaluation.mmd.gaussian_emd(x: ndarray, y: ndarray, sigma: float = 1.0, distance_scaling: float = 1.0) → float[source]#

Gaussian kernel with squared distance in exponential term replaced by EMD The inputs are PMF (Probability mass function). The Gaussian kernel is defined as k(x,y) = exp(-f(x,y)^2/(2*sigma^2)) where f(.,.) is the EMD function.

Parameters:

x (np.ndarray) – 1D pmf of the first distribution with the same support
y (np.ndarray) – 1D pmf of the second distribution with the same support
sigma (float, optional) – standard deviation. Defaults to 1.0.
distance_scaling (float, optional) – distance scaling factor. Defaults to 1.0.

Returns:

Gaussian kernel value

Return type:

float

ccsd.src.evaluation.mmd.gaussian(x: ndarray, y: ndarray, sigma: float = 1.0) → float[source]#

Gaussian kernel with squared distance in exponential term replaced by L2 distance The inputs are PMF (Probability mass function). The Gaussian kernel is defined as k(x,y) = exp(-N(x, y)^2/(2*sigma^2)) where N(.,.) is the L2 distance function.

Parameters:

x (np.ndarray) – 1D pmf of the first distribution with the same support
y (np.ndarray) – 1D pmf of the second distribution with the same support
sigma (float, optional) – standard deviation. Defaults to 1.0.

Returns:

Gaussian kernel value

Return type:

float

ccsd.src.evaluation.mmd.gaussian_tv(x: ndarray, y: ndarray, sigma: float = 1.0) → float[source]#

Gaussian kernel with squared distance in exponential term replaced by total variation distance (half L1 distance, used in transportation theory) The inputs are PMF (Probability mass function). The Gaussian kernel is defined as k(x,y) = exp(-f(x, y)^2/(2*sigma^2)) where f(x, y) = 0.5 * N(x, y) is the total variation distance (half L1 distance N).

Parameters:

x (np.ndarray) – 1D pmf of the first distribution with the same support
y (np.ndarray) – 1D pmf of the second distribution with the same support
sigma (float, optional) – standard deviation. Defaults to 1.0.

Returns:

Gaussian kernel value

Return type:

float

ccsd.src.evaluation.mmd.kernel_parallel_unpacked(x: ndarray, samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float]) → float[source]#

Calculate the sum of the kernel values between x and all the samples in samples2

Parameters:

x (np.ndarray) – “true sample”
samples2 (Iterator[np.ndarray]) – samples from the generator
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function

Returns:

sum of kernel values

Return type:

float

ccsd.src.evaluation.mmd.kernel_parallel_worker(t: Tuple[ndarray, Iterator[ndarray], Callable[[ndarray, ndarray], float]]) → float[source]#

Wrapper for kernel_parallel_unpacked

Parameters:: t (Tuple[np.ndarray, Iterator[np.ndarray], Callable[[np.ndarray, np.ndarray], float]]) – tuple of arguments
Returns:: sum of kernel values
Return type:: float

ccsd.src.evaluation.mmd.disc(samples1: Iterator[ndarray], samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float], is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False, progress_bar: bool = False, *args, **kwargs) → float[source]#

Calculate the discrepancy between 2 samples

Parameters:

samples1 (Iterator[np.ndarray]) – samples 1
samples2 (Iterator[np.ndarray]) – samples 2
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function
is_parallel (bool, optional) – whether or not we use parallel processing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.
progress_bar (bool, optional) – whether or not we print progress bar if is_parallel is set to False. Defaults to False.

Returns:

discrepancy

Return type:

float

ccsd.src.evaluation.mmd.compute_mmd(samples1: Iterator[ndarray], samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float], is_hist: bool = True, *args, **kwargs) → float[source]#

Calculate the MMD (Maximum Mean Discrepancy) between two samples

Parameters:

samples1 (Iterator[np.ndarray]) – samples 1
samples2 (Iterator[np.ndarray]) – samples 2
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function
is_hist (bool, optional) – whether or not we normalize the input to transform it into histograms. Defaults to True.

Returns:

MMD

Return type:

float

ccsd.src.evaluation.mmd.compute_emd(samples1: Iterator[ndarray], samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float], is_hist: bool = True, *args, **kwargs) → Tuple[float, List[ndarray]][source]#

Calculate the EMD (Earth Mover Distance) between the average of two samples

Parameters:

samples1 (Iterator[np.ndarray]) – samples 1
samples2 (Iterator[np.ndarray]) – samples 2
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function
is_hist (bool, optional) – whether or not we normalize the input to transform it into histograms. Defaults to True.

Returns:

EMD and the average of the two samples

Return type:

Tuple[float, List[np.ndarray]]

ccsd.src.evaluation.mmd.preprocess(X: ndarray, max_len: int, is_hist: bool) → ndarray[source]#

Preprocess function for the kernel_compute function below

Parameters:

X (np.ndarray) – input array
max_len (int) – max row length of the new array
is_hist (bool) – if the input array is an histogram

Returns:

preprocessed output array

Return type:

np.ndarray

ccsd.src.evaluation.mmd.kernel_compute(X: List[Graph], Y: List[Graph] | None = None, is_hist: bool = True, metric: str = 'linear', n_jobs: int | None = None) → ndarray[source]#

Function to compute the kernel matrix with list of graphs as inputs and a custom metric Adapted from https://github.com/idea-iitd/graphgen/blob/master/metrics/mmd.py

Parameters:

X (List[nx.Graph]) – samples 1 (list of graphs)
Y (Optional[List[nx.Graph]], optional) – samples 2 (list of graphs). Defaults to None.
is_hist (bool, optional) – whether of not the input should be histograms (NOT IMPLEMENTED). Defaults to True.
metric (str, optional) – metric. Defaults to “linear”.
n_jobs (Optional[int], optional) – number of jobs for parallel computing. Defaults to None.

Returns:

kernel matrix

Return type:

np.ndarray

ccsd.src.evaluation.mmd.compute_nspdk_mmd(samples1: List[Graph], samples2: List[Graph], metric: str, is_hist: bool = True, n_jobs: int | None = None) → float[source]#

Compute the MMD between two samples of graphs using the NSPDK kernel Adapted from https://github.com/idea-iitd/graphgen/blob/master/metrics/mmd.py

Parameters:

samples1 (List[nx.Graph]) – samples 1 (list of graphs)
samples2 (List[nx.Graph]) – samples 2 (list of graphs)
metric (str) – metric
is_hist (bool, optional) – whether of not the input should be histograms (NOT IMPLEMENTED). Defaults to True.
n_jobs (Optional[int], optional) – number of jobs for parallel computing. Defaults to None.

ccsd.src.evaluation.mmd.process_tensor(x: ndarray, y: ndarray) → Tuple[ndarray, ndarray][source]#

Process two tensors (vectors) to have the same size (support)

Parameters:

x (np.ndarray) – vector 1
y (np.ndarray) – vector 2

Returns:

processed vectors

Return type:

Tuple[np.ndarray, np.ndarray]

stats.py: code for computing statistics of graphs.

Adapted from Jo, J. & al (2022)

ccsd.src.evaluation.stats.degree_worker(G: Graph) → ndarray[source]#

Function for computing the degree histogram of a graph.

Returns:: degree histogram
Return type:: np.ndarray

ccsd.src.evaluation.stats.add_tensor(x: ndarray, y: ndarray) → ndarray[source]#

Function for extending the dimension of two tensors to make them having the same support and add them together.

Parameters:

x (np.ndarray) – vector 1
y (np.ndarray) – vector 2

Returns:

sum of vector 1 and vector 2

Return type:

np.ndarray

ccsd.src.evaluation.stats.degree_stats(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False) → float[source]#

Compute the MMD distance between the degree distributions of two unordered sets of graphs.

Parameters:

graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.

Returns:

MMD distance

Return type:

float

ccsd.src.evaluation.stats.spectral_worker(G: Graph) → ndarray[source]#

Function for computing the spectral density of a graph.

Parameters:: G (nx.Graph) – input graph
Returns:: spectral density
Return type:: np.ndarray

ccsd.src.evaluation.stats.spectral_stats(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False) → ndarray[source]#

Compute the MMD distance between the spectral densities of two unordered sets of graphs.

Parameters:

graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.

Returns:

spectral distance

Return type:

np.ndarray

ccsd.src.evaluation.stats.clustering_worker(param: Tuple[Graph, int]) → ndarray[source]#

Function for computing the histogram of clustering coefficient of a graph.

Parameters:: param (Tuple[nx.Graph, int]) – input graph and number of bins
Returns:: histogram of clustering coefficient
Return type:: np.ndarray

ccsd.src.evaluation.stats.clustering_stats(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, bins: int = 100, is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False) → ndarray[source]#

Compute the MMD distance between the clustering coefficients of two unordered sets of graphs. For unweighted graphs, the clustering coefficient of a node u is the fraction of possible triangles through that node that exist.

Parameters:

graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
bins (int, optional) – number of bins for the histogram. Defaults to 100.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.

Returns:

mmd distance

Return type:

float

ccsd.src.evaluation.stats.edge_list_reindexed(G: Graph) → List[Tuple[int, int]][source]#

Reindex the nodes of a graph to be contiguous integers starting from 0.

Parameters:: G (nx.Graph) – input graph
Returns:: list of edges (index_u, index_v)
Return type:: List[Tuple[int, int]]

ccsd.src.evaluation.stats.orca(graph: Graph, orca_dir: str) → ndarray[source]#

Compute the orbit counts of a graph using orca.

Parameters:

graph (nx.Graph) – input graph
orca_dir (str) – path to the orca directory where the executable are

Returns:

orbit counts

Return type:

np.ndarray

ccsd.src.evaluation.stats.orbit_stats_all(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian>, folder: str = './') → float[source]#

Compute the MMD distance between the orbits of two unordered sets of graphs.

Parameters:

graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian.
folder (str, optional) – path to the main folder where the ccsd/src/evaluation folders are to locate the orca executable. Defaults to “./”.

Returns:

mmd distance

Return type:

float

ccsd.src.evaluation.stats.nspdk_stats(graph_ref_list: List[Graph], graph_pred_list: List[Graph]) → float[source]#

Compute the MMD distance between the NSPDK kernel of two unordered sets of graphs.

Adapted from https://github.com/idea-iitd/graphgen/blob/master/metrics/stats.py

Parameters:

graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (nx.Graph) – target list of networkx graphs to be evaluated

Returns:

mmd distance

Return type:

float

ccsd.src.evaluation.stats.eval_graph_list(graph_ref_list: List[Graph], graph_pred_list: List[Graph], methods: List[str] | None = None, kernels: Dict[str, Callable[[ndarray, ndarray], float]] | None = None, folder: str = './') → Dict[str, float][source]#

Evaluate generated generic graphs against a reference set of graphs using a set of methods and their corresponding kernels.

Parameters:

graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
methods (Optional[List[str]], optional) – methods to be evaluated. Defaults to None.
kernels (Optional[Dict[str, Callable[[np.ndarray, np.ndarray], float]]], optional) – kernels to be used for each methods. Defaults to None.
folder (str, optional) – path to the main folder where the ccsd/src/evaluation folders are to locate the orca executable. Defaults to “./”.

Returns:

dictionary mapping method names to their corresponding scores

Return type:

Dict[str, float]

ccsd.src.evaluation.stats.eval_torch_batch(ref_batch: Tensor, pred_batch: Tensor, methods: List[str] | None = None, folder: str = './') → Dict[str, float][source]#

Evaluate generated generic graphs against a reference set of graphs using a set of methods and their corresponding kernels, with the input graphs in torch.Tensor format (adjacency matrices).

Parameters:

ref_batch (torch.Tensor) – reference batch of adjacency matrices
pred_batch (torch.Tensor) – target batch of adjacency matrices
methods (Optional[List[str]], optional) – methods to be evaluated. Defaults to None.
folder (str, optional) – path to the main folder where the ccsd/src/evaluation folders are to locate the orca executable. Defaults to “./”.

Returns:

dictionary mapping method names to their corresponding scores

Return type:

Dict[str, float]

data_generators.py: functions and GraphGenerator class for generating graphs and graph/combinatorial complexes datasets with given properties. Run this script with -h flag to see usage on how to generate graph and combinatorial complex datasets. The arguments are (see ccsd/src/parsers/parser_generator.py for more details):

–data-dir: directory to save generated graphs. Default: “data”. –dataset: name of dataset to generate (default “grid”), choices are [“ego_small”, “community_small”, “ENZYMES”, “ENZYMES_small”, “grid”]. –is_cc: if you want to generate combinatorial complexes instead of graphs –folder: Directory to save the results, load checkpoints, load config, etc. Default: “./”.

Adapted from Jo, J. & al (2022) for the graph generation part.

ccsd.data.data_generators.n_community(num_communities: int, max_nodes: int, p_inter: float = 0.05) → Graph[source]#

Generate a graph with num_communities communities, each of size max_nodes and with inter-community edge probability p_inter. From Niu et al. (2020)

Parameters:

num_communities (int) – number of communities
max_nodes (int) – maximum number of nodes in each community
p_inter (float, optional) – inter-community edge probability. Defaults to 0.05.

Returns:

generated graph

Return type:

nx.Graph

class ccsd.data.data_generators.GraphGenerator(graph_type: str = 'grid', possible_params_dict: Dict[str, int | ndarray] | None = None, corrupt_func: Callable[[Any], Graph] | None = None)[source]#

Bases: object

Graph generator class.

__init__(graph_type: str = 'grid', possible_params_dict: Dict[str, int | ndarray] | None = None, corrupt_func: Callable[[Any], Graph] | None = None) → None[source]#

Initialize graph generator.

Parameters:

graph_type (str, optional) – type of graphs to generate. Defaults to “grid”.
possible_params_dict (Optional[Dict[str, Union[int, np.ndarray]]], optional) – set of parameters to randomly select. Defaults to None.
corrupt_func (Optional[Callable[[Any], nx.Graph]], optional) – optional function that generates a constant graph (for debugging for example). Defaults to None.

ccsd.data.data_generators.gen_graph_list(graph_type: str = 'grid', possible_params_dict: Dict[str, int | ndarray] | None = None, corrupt_func: Callable[[Any], Graph] | None = None, length: int = 1024, save_dir: str | None = None, file_name: str | None = None, max_node: int | None = None, min_node: int | None = None) → List[Graph][source]#

Generate a list of synthetic graphs.

Parameters:

graph_type (str, optional) – type of graphs to generate. Defaults to “grid”.
possible_params_dict (Optional[Dict[str, Union[int, np.ndarray]]], optional) – set of parameters to randomly select. Defaults to None.
corrupt_func (Optional[Callable[[Any], nx.Graph]], optional) – optional function that generates a constant graph (for debugging for example). Defaults to None.
length (int, optional) – number of graphs to generate. Defaults to 1024.
save_dir (Optional[str], optional) – where to save the generate list of graph. Defaults to None.
file_name (Optional[str], optional) – name of the file. Defaults to None.
max_node (Optional[int], optional) – maximum number of nodes. Defaults to None.
min_node (Optional[int], optional) – minimum number of nodes. Defaults to None.

Returns:

list of generated graphs

Return type:

List[nx.Graph]

ccsd.data.data_generators.load_dataset(data_dir: str = 'data', file_name: str | None = None) → List[Graph] | List[CombinatorialComplex][source]#

Load an existing dataset as a list of graphs or list of combinatorial complexes from a file.

Parameters:

data_dir (str, optional) – directory of the dataset. Defaults to “data”.
file_name (Optional[str], optional) – name of the file. Defaults to None.

Returns:

list of graphs or list of combinatorial complexes

Return type:

Union[List[nx.Graph], List[CombinatorialComplex]]

ccsd.data.data_generators.graph_load_batch(min_num_nodes: int = 20, max_num_nodes: int = 1000, name: str = 'ENZYMES', node_attributes: bool = True, graph_labels: bool = True, folder: str = './') → List[Graph][source]#

Load a graph dataset, for ENZYMES, PROTEIN and DD.

Parameters:

min_num_nodes (int, optional) – minimum number of nodes. Defaults to 20.
max_num_nodes (int, optional) – maximum number of nodes. Defaults to 1000.
name (str, optional) – name of the dataset to load. Defaults to “ENZYMES”.
node_attributes (bool, optional) – if True, also load the node attributes. Defaults to True.
graph_labels (bool, optional) – if True, also load the graph labels. Defaults to True.
folder (str, optional) – directory of the data/dataset/ folders. Defaults to “./”.

Returns:

list of graphs

Return type:

List[nx.Graph]

ccsd.data.data_generators.parse_index_file(filename: str) → List[int][source]#

Parse an index file (list of integers).

Parameters:: filename (str) – name of the file
Returns:: list of indices as integers
Return type:: List[int]

ccsd.data.data_generators.graph_load(dataset: str = 'cora', folder: str = './') → Tuple[spmatrix, List[Graph]][source]#

Load the citation datasets: cora, citeseer or pubmed.

Parameters:

dataset (str, optional) – name of the dataset to load. Defaults to “cora”.
folder (str, optional) – directory of the data/dataset/ folders. Defaults to “./”.

Returns:

tuple of features and the graph

Return type:

Tuple[sp.spmatrix, List[nx.Graph]]

ccsd.data.data_generators.citeseer_ego(radius: int = 3, node_min: int = 50, node_max: int = 400, folder: str = './') → List[Graph][source]#

Load the citeseer dataset, keep the largest connected component, and extract the ego graphs (graphs of nodes within a certain radius) with a number of nodes within our range.

Parameters:

radius (int, optional) – radius. Defaults to 3.
node_min (int, optional) – minimum number of nodes in our dataset. Defaults to 50.
node_max (int, optional) – maximum number of nodes in our dataset. Defaults to 400.
folder (str, optional) – directory of the data/dataset/ folders. Defaults to “./”.

Returns:

list of (ego) graphs

Return type:

List[nx.Graph]

ccsd.data.data_generators.save_dataset(data_dir: str, obj: List[Graph] | List[CombinatorialComplex], save_name: str, save_txt: bool = True) → None[source]#

Save the dataset (objects) in the specified directory.

Parameters:

data_dir (str) – directory to save the dataset
obj (Union[List[nx.Graph], List[CombinatorialComplex]]) – list of objects to save
save_name (str) – name of the dataset
save_txt (bool, optional) – whether to save a txt file with the name and the number of objects (or size of DataLoader). Defaults to True.

ccsd.data.data_generators.generate_dataset(args: Namespace) → None[source]#

Generate a graph/combinatorial complex dataset and save it in the specified directory.

Parameters:: args (argparse.Namespace) – arguments
Raises:: NotImplementedError – raise and error if the specified dataset is not implemented

preprocess_for_nspdk.py: preprocess the test molecules for NSPDK.

Adapted from Jo, J. & al (2022)

ccsd.data.preprocess_for_nspdk.preprocess_nspdk(args: Namespace, print_elapsed_time: bool = True) → None[source]#

Preprocess the test molecules for NSPDK

Parameters:

args (argparse.Namespace) – arguments
print_elapsed_time (bool, True) – if True, print the elapsed time to preprocess the test molecules. Defaults to True.

Raises:

ValueError – raise an error if the dataset is not supported. Molecule dataset supported: QM9, ZINC250k

preprocess.py: preprocess the molecule datasets (not for NSPDK).

Adapted from Jo, J. & al (2022)

ccsd.data.preprocess.preprocess(args: Namespace, print_elapsed_time: bool = True) → None[source]#

Preprocess the molecules (not for NSPDK)

Adapted from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow

Parameters:

args (argparse.Namespace) – arguments
print_elapsed_time (bool, optional) – if True, print the elapsed time to preprocess the molecules. Defaults to True.

Raises:

ValueError – raise an error if the dataset is not supported. Molecule dataset supported: QM9, ZINC250k

data_frame_parser.py: preprocess the molecule datasets (not for NSPDK). Just used to in data/preprocess.py. Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetparsersdata_frame_parser.py Code from Jo, J. & al (2022)

Left untouched.

class ccsd.data.utils.data_frame_parser.DataFrameParser(preprocessor: GGNNPreprocessor, labels: List[str] | None = None, smiles_col: str = 'smiles', postprocess_label: Callable[[List[str]], List[str]] | None = None, postprocess_fn: Callable[[List[ndarray] | Tuple[ndarray]], List[ndarray] | Tuple[ndarray]] | None = None, logger: Logger | None = None)[source]#

Bases: object

DataFrame parser class. Just used to in data/preprocess.py.

Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetparsersdata_frame_parser.py

__init__(preprocessor: GGNNPreprocessor, labels: List[str] | None = None, smiles_col: str = 'smiles', postprocess_label: Callable[[List[str]], List[str]] | None = None, postprocess_fn: Callable[[List[ndarray] | Tuple[ndarray]], List[ndarray] | Tuple[ndarray]] | None = None, logger: Logger | None = None)[source]#

parse(df, return_smiles=False, target_index=None, return_is_successful=False)[source]#

extract_total_num(df)[source]#

numpytupledataset.py: NumpyTupleDataset class. Just used to in data/preprocess.py. Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Code from Jo, J. & al (2022)

Left untouched.

class ccsd.data.utils.numpytupledataset.NumpyTupleDataset(datasets, transform=None)[source]#

Bases: Dataset

NumpyTupleDataset class. Just used to in data/preprocess.py.

Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow

__init__(datasets, transform=None)[source]#

get_datasets()[source]#

classmethod save(filepath, numpy_tuple_dataset)[source]#

classmethod load(filepath, transform=None)[source]#

numpytupledataset.py: NumpyTupleDataset class. Just used to in data/preprocess.py. Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetpreprocessorscommon Code from Jo, J. & al (2022)

Left untouched.

class ccsd.data.utils.smile_to_graph.GGNNPreprocessor(max_atoms=-1, out_size=-1, add_Hs=False, kekulize=True)[source]#

Bases: object

GGNN Preprocessor. Just used to in data/preprocess.py.

Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetpreprocessorscommon

__init__(max_atoms=-1, out_size=-1, add_Hs=False, kekulize=True)[source]#

get_input_features(mol)[source]#

prepare_smiles_and_mol(mol)[source]#

get_label(mol, label_names=None)[source]#

exception ccsd.data.utils.smile_to_graph.MolFeatureExtractionError[source]#: Bases: Exception

ccsd.data.utils.smile_to_graph.type_check_num_atoms(mol, num_max_atoms=-1)[source]#

ccsd.data.utils.smile_to_graph.construct_atomic_number_array(mol, out_size=-1)[source]#

ccsd.data.utils.smile_to_graph.construct_adj_matrix(mol, out_size=-1, self_connection=True)[source]#

ccsd.data.utils.smile_to_graph.construct_discrete_edge_matrix(mol, out_size=-1)[source]#