Others#
All the other modules that are not part of the main program such as the parsers, the evaluation functions, the data generation functions, etc.
Most of these functions has been adapted from MoFlow (https://github.com/calvin-zcx/moflow) and Jo, J. & al (2022) (https://github.com/harryjo97/GDSS).
parser.py: code for parsing the arguments of the main script (experiments).
Adapted from Jo, J. & al (2022)
Almost left untouched.
- class ccsd.src.parsers.parser.Parser[source]#
Bases:
object
Parser class to parse the arguments to run the experiments.
parser_generator.py: code for parsing the arguments of the graph dataset generator.
- class ccsd.src.parsers.parser_generator.ParserGenerator[source]#
Bases:
object
ParserGenerator class to parse the arguments to create graph datasets.
parser_preprocess.py: code for parsing the arguments of the scripts that preprocess the molecule datasets.
- class ccsd.src.parsers.parser_preprocess.ParserPreprocess[source]#
Bases:
object
ParserPreprocess class to parse the arguments of the scripts that preprocess the molecule datasets.
config.py: code for loading the config file.
Adapted from Jo, J. & al (2022)
- ccsd.src.parsers.config.get_config(config: str, seed: int, folder: str = './') EasyDict [source]#
Load the config file.
- Parameters:
config (str) – name of the config file.
seed (int) – random seed (to be added to the config object).
folder (str, optional) – folder where the config folder is located. Defaults to “./”.
- Returns:
configuration object.
- Return type:
EasyDict
- ccsd.src.parsers.config.get_general_config(folder: str = './') EasyDict [source]#
Get the general configuration.
- Parameters:
folder (str, optional) – folder where the config folder is located. Defaults to “./”.
- Returns:
general configuration.
- Return type:
EasyDict
eden.py: Provides interface for vectorizer.
Code adapted from https://github.com/fabriziocosta/EDeN
Left untouched.
- class ccsd.src.evaluation.eden.AbstractVectorizer[source]#
Bases:
BaseEstimator
,TransformerMixin
Interface declaration for the Vectorizer class.
- set_params(**args)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_transform_request(*, graphs: bool | None | str = '$UNCHANGED$') AbstractVectorizer #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.- Parameters:
graphs (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
graphs
parameter intransform
.- Returns:
self – The updated object.
- Return type:
object
- ccsd.src.evaluation.eden.run_dill_encoded(what)[source]#
Use dill as replacement for pickle to enable multiprocessing on instance methods
- ccsd.src.evaluation.eden.apply_async(pool, fun, args, callback=None)[source]#
Wrapper around apply_async() from multiprocessing, to use dill instead of pickle. This is a workaround to enable multiprocessing of classes.
- ccsd.src.evaluation.eden.auto_label(graphs, n_clusters=16, **opts)[source]#
Label nodes with cluster id. Cluster nodes using as features the output of vertex_vectorize.
- ccsd.src.evaluation.eden.auto_relabel(graphs, n_clusters=16, **opts)[source]#
Label nodes with cluster id.
- ccsd.src.evaluation.eden.vectorize(graphs, **opts)[source]#
Transform real vector labeled, weighted graphs in sparse vectors.
- ccsd.src.evaluation.eden.vertex_vectorize(graphs, **opts)[source]#
Transform a list of networkx graphs into a list of sparse matrices.
- ccsd.src.evaluation.eden.annotate(graphs, estimator=None, reweight=1.0, vertex_features=False, **opts)[source]#
Return graphs with extra node attributes: importance and features.
- class ccsd.src.evaluation.eden.Vectorizer(complexity=3, r=None, d=None, min_r=0, min_d=0, weights_dict=None, auto_weights=False, nbits=16, normalization=True, inner_normalization=True, positional=False, discrete=True, use_only_context=False, key_label='label', key_weight='weight', key_nesting='nesting', key_importance='importance', key_class='class', key_vec='vec', key_svec='svec')[source]#
Bases:
AbstractVectorizer
Transform real vector labeled, weighted graphs in sparse vectors.
- __init__(complexity=3, r=None, d=None, min_r=0, min_d=0, weights_dict=None, auto_weights=False, nbits=16, normalization=True, inner_normalization=True, positional=False, discrete=True, use_only_context=False, key_label='label', key_weight='weight', key_nesting='nesting', key_importance='importance', key_class='class', key_vec='vec', key_svec='svec')[source]#
Constructor.
- Parameters:
complexity (int, optional) – The complexity of the features extracted. This is equivalent to setting r = complexity, d = complexity. Defaults to 3.
r (int) – The maximal radius size.
d (int) – The maximal distance size.
min_r (int) – The minimal radius size.
min_d (int) – The minimal distance size.
weights_dict (Dict[Tuple[float, float], float]) – Dictionary with keys = pairs (radius, distance) and value = weights.
auto_weights (bool, optional) – Flag to set to 1 the weight of the kernels for r=i, d=i for i in range(complexity) Defaults to False.
nbits (int, optional) – The number of bits that defines the feature space size: size(feature space)=2^nbits. Defaults to 16.
normalization (bool, optional) – Flag to set the resulting feature vector to have unit euclidean norm. Defaults to True.
inner_normalization (bool, optional) – Flag to set the feature vector for a specific combination of the radius and distance size to have unit euclidean norm. When used together with the ‘normalization’ flag it will be applied first and then the resulting feature vector will be normalized. Defaults to True.
positional (bool, optional) – Flag to make the relative position be sorted by the node ID value. This is useful for ensuring isomorphism for sequences. Defaults to False.
discrete (bool, optional) – Flag to activate more efficient computation of vectorization considering only discrete labels and ignoring vector attributes. Defaults to False.
use_only_context (bool, optional) – Flag to deactivate the central part of the information and retain only the context. Defaults to False.
key_label (string, optional) – The key used to indicate the label information in nodes. Defaults to “label”.
key_weight (string, optional) – The key used to indicate the weight information in nodes. Defaults to “weight”.
key_nesting (string, optional) – The key used to indicate the nesting type in edges. Defaults to “nesting”.
key_importance (string, optional) – The key used to indicate the importance information in nodes. Defaults to “importance”.
key_class (string, optional) – The key used to indicate the predicted class associated to the node. Defaults to “class”.
key_vec (string, optional) – The key used to indicate the vector label information in nodes. Defaults to “vec”.
key_svec (string, optional) – The key used to indicate the sparse vector label information in nodes. Defaults to “svec”.
- get_params()[source]#
Get parameters for teh vectorizer. :returns: params – Parameter names mapped to their values. :rtype: mapping of string to any
- transform(graphs: List[Graph])[source]#
Transform a list of networkx graphs into a sparse matrix.
- Parameters:
graphs (List[nx.Graph]) – The input list of networkx graphs.
- Returns:
shape = [n_samples, n_features] Vector representation of input graphs.
- Return type:
data_matrix (array-like)
Examples
>>> # transforming the same graph >>> import networkx as nx >>> def get_path_graph(length=4): ... g = nx.path_graph(length) ... for n,d in g.nodes(data=True): ... d['label'] = 'C' ... for a,b,d in g.edges(data=True): ... d['label'] = '1' ... return g >>> g = get_path_graph(4) >>> g2 = get_path_graph(5) >>> g2.remove_node(4) >>> v = Vectorizer() >>> def vec_to_hash(vec): ... return hash(tuple(vec.data + vec.indices)) >>> vec_to_hash(v.transform([g])) == vec_to_hash(v.transform([g2])) True
- vertex_transform(graphs: List[Graph])[source]#
Transform a list of networkx graphs into a list of sparse matrices. Each matrix has dimension n_nodes x n_features, i.e. each vertex is associated to a sparse vector that encodes the neighborhood of the vertex up to radius + distance.
- Parameters:
graphs (List[nx.Graph]) – The input list of networkx graphs.
- Returns:
shape = [n_samples, [n_nodes, n_features]] Vector representation of each vertex in the input graphs.
- Return type:
matrix_list (array-like)
- annotate(graphs, estimator=None, reweight=1.0, threshold=None, scale=1, vertex_features=False)[source]#
Return graphs with extra attributes: importance and features. Given a list of networkx graphs, if the given estimator is not None and is fitted, return a list of networkx graphs where each vertex has additional attributes with key ‘importance’ and ‘weight’. The importance value of a vertex corresponds to the part of the score that is imputable to the neighborhood of radius r+d of the vertex. The weight value is the absolute value of importance. If vertex_features is True then each vertex has additional attributes with key ‘features’ and ‘vector’.
- Parameters:
estimator (scikit-learn estimator) – Scikit-learn predictor trained on data sampled from the same distribution. If None the vertex weights are set by default 1.
reweight (float, optional) – The coefficient used to weight the linear combination of the current weight and the absolute value of the score computed by the estimator. If reweight = 0 then do not update. If reweight = 1 then discard the current weight information and use only abs( score ) If reweight = 0.5 then update with the arithmetic mean of the current weight information and the abs( score ) Defaults to 1.0.
threshold (Optional[float], optional) – If not None, threshold the importance value before computing the weight. Defaults to None.
scale (float, optional) – Multiplicative factor to rescale all weights. Defaults to 1.
vertex_features (bool, optional) – Flag to compute the sparse vector encoding of all features that have that vertex as root. An attribute with key ‘features’ is created for each node that contains a CRS scipy sparse vector, and an attribute with key ‘vector’ is created that contains a python dictionary to store the key, values pairs. Defaults to False.
- set_transform_request(*, graphs: bool | None | str = '$UNCHANGED$') Vectorizer #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.- Parameters:
graphs (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
graphs
parameter intransform
.- Returns:
self – The updated object.
- Return type:
object
- ccsd.src.evaluation.eden.serialize_dict(the_dict, full=True, offset='small')[source]#
serialize_dict.
mmd.py: code for computing MMD (Maximum Mean Discrepancy), kernel based statistical test used to determine whether given two distribution are the same. Also contains functions to calculate the EMD (Earth Mover’s Distance) and the L2 distance between two histograms, in addition to Gaussian kernels with these distances.
Adapted from Jo, J. & al (2022)
- ccsd.src.evaluation.mmd.emd(x: ndarray, y: ndarray, distance_scaling: float = 1.0) float [source]#
- Calculate the earth mover’s distance (EMD) between two histograms
It corresponds to the Wasserstein metric (see Optimal transport theory) The formula is (inf_{gama in Gama(mu,
- int_{M*M} d(x,y)^p dgama(x,y))^(1/p).
Adapted from From Niu et al. (2020)
- Args:
x (np.ndarray): histogram of first distribution y (np.ndarray): histogram of second distribution distance_scaling (float, optional): distance scaling factor. Defaults to 1.0.
- Returns:
float: EMD value
- ccsd.src.evaluation.mmd.l2(x: ndarray, y: ndarray) float [source]#
Calculate the L2 distance between two histograms
- Parameters:
x (np.ndarray) – histogram of first distribution
y (np.ndarray) – histogram of second distribution
- Returns:
L2 distance
- Return type:
float
- ccsd.src.evaluation.mmd.gaussian_emd(x: ndarray, y: ndarray, sigma: float = 1.0, distance_scaling: float = 1.0) float [source]#
Gaussian kernel with squared distance in exponential term replaced by EMD The inputs are PMF (Probability mass function). The Gaussian kernel is defined as k(x,y) = exp(-f(x,y)^2/(2*sigma^2)) where f(.,.) is the EMD function.
- Parameters:
x (np.ndarray) – 1D pmf of the first distribution with the same support
y (np.ndarray) – 1D pmf of the second distribution with the same support
sigma (float, optional) – standard deviation. Defaults to 1.0.
distance_scaling (float, optional) – distance scaling factor. Defaults to 1.0.
- Returns:
Gaussian kernel value
- Return type:
float
- ccsd.src.evaluation.mmd.gaussian(x: ndarray, y: ndarray, sigma: float = 1.0) float [source]#
Gaussian kernel with squared distance in exponential term replaced by L2 distance The inputs are PMF (Probability mass function). The Gaussian kernel is defined as k(x,y) = exp(-N(x, y)^2/(2*sigma^2)) where N(.,.) is the L2 distance function.
- Parameters:
x (np.ndarray) – 1D pmf of the first distribution with the same support
y (np.ndarray) – 1D pmf of the second distribution with the same support
sigma (float, optional) – standard deviation. Defaults to 1.0.
- Returns:
Gaussian kernel value
- Return type:
float
- ccsd.src.evaluation.mmd.gaussian_tv(x: ndarray, y: ndarray, sigma: float = 1.0) float [source]#
Gaussian kernel with squared distance in exponential term replaced by total variation distance (half L1 distance, used in transportation theory) The inputs are PMF (Probability mass function). The Gaussian kernel is defined as k(x,y) = exp(-f(x, y)^2/(2*sigma^2)) where f(x, y) = 0.5 * N(x, y) is the total variation distance (half L1 distance N).
- Parameters:
x (np.ndarray) – 1D pmf of the first distribution with the same support
y (np.ndarray) – 1D pmf of the second distribution with the same support
sigma (float, optional) – standard deviation. Defaults to 1.0.
- Returns:
Gaussian kernel value
- Return type:
float
- ccsd.src.evaluation.mmd.kernel_parallel_unpacked(x: ndarray, samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float]) float [source]#
Calculate the sum of the kernel values between x and all the samples in samples2
- Parameters:
x (np.ndarray) – “true sample”
samples2 (Iterator[np.ndarray]) – samples from the generator
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function
- Returns:
sum of kernel values
- Return type:
float
- ccsd.src.evaluation.mmd.kernel_parallel_worker(t: Tuple[ndarray, Iterator[ndarray], Callable[[ndarray, ndarray], float]]) float [source]#
Wrapper for kernel_parallel_unpacked
- Parameters:
t (Tuple[np.ndarray, Iterator[np.ndarray], Callable[[np.ndarray, np.ndarray], float]]) – tuple of arguments
- Returns:
sum of kernel values
- Return type:
float
- ccsd.src.evaluation.mmd.disc(samples1: Iterator[ndarray], samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float], is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False, progress_bar: bool = False, *args, **kwargs) float [source]#
Calculate the discrepancy between 2 samples
- Parameters:
samples1 (Iterator[np.ndarray]) – samples 1
samples2 (Iterator[np.ndarray]) – samples 2
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function
is_parallel (bool, optional) – whether or not we use parallel processing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.
progress_bar (bool, optional) – whether or not we print progress bar if is_parallel is set to False. Defaults to False.
- Returns:
discrepancy
- Return type:
float
- ccsd.src.evaluation.mmd.compute_mmd(samples1: Iterator[ndarray], samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float], is_hist: bool = True, *args, **kwargs) float [source]#
Calculate the MMD (Maximum Mean Discrepancy) between two samples
- Parameters:
samples1 (Iterator[np.ndarray]) – samples 1
samples2 (Iterator[np.ndarray]) – samples 2
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function
is_hist (bool, optional) – whether or not we normalize the input to transform it into histograms. Defaults to True.
- Returns:
MMD
- Return type:
float
- ccsd.src.evaluation.mmd.compute_emd(samples1: Iterator[ndarray], samples2: Iterator[ndarray], kernel: Callable[[ndarray, ndarray], float], is_hist: bool = True, *args, **kwargs) Tuple[float, List[ndarray]] [source]#
Calculate the EMD (Earth Mover Distance) between the average of two samples
- Parameters:
samples1 (Iterator[np.ndarray]) – samples 1
samples2 (Iterator[np.ndarray]) – samples 2
kernel (Callable[[np.ndarray, np.ndarray], float]) – kernel function
is_hist (bool, optional) – whether or not we normalize the input to transform it into histograms. Defaults to True.
- Returns:
EMD and the average of the two samples
- Return type:
Tuple[float, List[np.ndarray]]
- ccsd.src.evaluation.mmd.preprocess(X: ndarray, max_len: int, is_hist: bool) ndarray [source]#
Preprocess function for the kernel_compute function below
- Parameters:
X (np.ndarray) – input array
max_len (int) – max row length of the new array
is_hist (bool) – if the input array is an histogram
- Returns:
preprocessed output array
- Return type:
np.ndarray
- ccsd.src.evaluation.mmd.kernel_compute(X: List[Graph], Y: List[Graph] | None = None, is_hist: bool = True, metric: str = 'linear', n_jobs: int | None = None) ndarray [source]#
Function to compute the kernel matrix with list of graphs as inputs and a custom metric Adapted from https://github.com/idea-iitd/graphgen/blob/master/metrics/mmd.py
- Parameters:
X (List[nx.Graph]) – samples 1 (list of graphs)
Y (Optional[List[nx.Graph]], optional) – samples 2 (list of graphs). Defaults to None.
is_hist (bool, optional) – whether of not the input should be histograms (NOT IMPLEMENTED). Defaults to True.
metric (str, optional) – metric. Defaults to “linear”.
n_jobs (Optional[int], optional) – number of jobs for parallel computing. Defaults to None.
- Returns:
kernel matrix
- Return type:
np.ndarray
- ccsd.src.evaluation.mmd.compute_nspdk_mmd(samples1: List[Graph], samples2: List[Graph], metric: str, is_hist: bool = True, n_jobs: int | None = None) float [source]#
Compute the MMD between two samples of graphs using the NSPDK kernel Adapted from https://github.com/idea-iitd/graphgen/blob/master/metrics/mmd.py
- Parameters:
samples1 (List[nx.Graph]) – samples 1 (list of graphs)
samples2 (List[nx.Graph]) – samples 2 (list of graphs)
metric (str) – metric
is_hist (bool, optional) – whether of not the input should be histograms (NOT IMPLEMENTED). Defaults to True.
n_jobs (Optional[int], optional) – number of jobs for parallel computing. Defaults to None.
- ccsd.src.evaluation.mmd.process_tensor(x: ndarray, y: ndarray) Tuple[ndarray, ndarray] [source]#
Process two tensors (vectors) to have the same size (support)
- Parameters:
x (np.ndarray) – vector 1
y (np.ndarray) – vector 2
- Returns:
processed vectors
- Return type:
Tuple[np.ndarray, np.ndarray]
stats.py: code for computing statistics of graphs.
Adapted from Jo, J. & al (2022)
- ccsd.src.evaluation.stats.degree_worker(G: Graph) ndarray [source]#
Function for computing the degree histogram of a graph.
- Returns:
degree histogram
- Return type:
np.ndarray
- ccsd.src.evaluation.stats.add_tensor(x: ndarray, y: ndarray) ndarray [source]#
Function for extending the dimension of two tensors to make them having the same support and add them together.
- Parameters:
x (np.ndarray) – vector 1
y (np.ndarray) – vector 2
- Returns:
sum of vector 1 and vector 2
- Return type:
np.ndarray
- ccsd.src.evaluation.stats.degree_stats(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False) float [source]#
Compute the MMD distance between the degree distributions of two unordered sets of graphs.
- Parameters:
graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.
- Returns:
MMD distance
- Return type:
float
- ccsd.src.evaluation.stats.spectral_worker(G: Graph) ndarray [source]#
Function for computing the spectral density of a graph.
- Parameters:
G (nx.Graph) – input graph
- Returns:
spectral density
- Return type:
np.ndarray
- ccsd.src.evaluation.stats.spectral_stats(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False) ndarray [source]#
Compute the MMD distance between the spectral densities of two unordered sets of graphs.
- Parameters:
graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.
- Returns:
spectral distance
- Return type:
np.ndarray
- ccsd.src.evaluation.stats.clustering_worker(param: Tuple[Graph, int]) ndarray [source]#
Function for computing the histogram of clustering coefficient of a graph.
- Parameters:
param (Tuple[nx.Graph, int]) – input graph and number of bins
- Returns:
histogram of clustering coefficient
- Return type:
np.ndarray
- ccsd.src.evaluation.stats.clustering_stats(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian_emd>, bins: int = 100, is_parallel: bool = True, max_workers: int | None = None, debug_mode: bool = False) ndarray [source]#
Compute the MMD distance between the clustering coefficients of two unordered sets of graphs. For unweighted graphs, the clustering coefficient of a node u is the fraction of possible triangles through that node that exist.
- Parameters:
graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian_emd.
bins (int, optional) – number of bins for the histogram. Defaults to 100.
is_parallel (bool, optional) – if True, do parallel computing. Defaults to True.
max_workers (Optional[int], optional) – number of workers (if is_parallel). Defaults to None.
debug_mode (bool, optional) – whether or not we print debug info for parallel computing. Defaults to False.
- Returns:
mmd distance
- Return type:
float
- ccsd.src.evaluation.stats.edge_list_reindexed(G: Graph) List[Tuple[int, int]] [source]#
Reindex the nodes of a graph to be contiguous integers starting from 0.
- Parameters:
G (nx.Graph) – input graph
- Returns:
list of edges (index_u, index_v)
- Return type:
List[Tuple[int, int]]
- ccsd.src.evaluation.stats.orca(graph: Graph, orca_dir: str) ndarray [source]#
Compute the orbit counts of a graph using orca.
- Parameters:
graph (nx.Graph) – input graph
orca_dir (str) – path to the orca directory where the executable are
- Returns:
orbit counts
- Return type:
np.ndarray
- ccsd.src.evaluation.stats.orbit_stats_all(graph_ref_list: ~typing.List[~networkx.classes.graph.Graph], graph_pred_list: ~typing.List[~networkx.classes.graph.Graph], kernel: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function gaussian>, folder: str = './') float [source]#
Compute the MMD distance between the orbits of two unordered sets of graphs.
- Parameters:
graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
kernel (Callable[[np.ndarray, np.ndarray], float], optional) – kernel function. Defaults to gaussian.
folder (str, optional) – path to the main folder where the ccsd/src/evaluation folders are to locate the orca executable. Defaults to “./”.
- Returns:
mmd distance
- Return type:
float
- ccsd.src.evaluation.stats.nspdk_stats(graph_ref_list: List[Graph], graph_pred_list: List[Graph]) float [source]#
Compute the MMD distance between the NSPDK kernel of two unordered sets of graphs.
Adapted from https://github.com/idea-iitd/graphgen/blob/master/metrics/stats.py
- Parameters:
graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (nx.Graph) – target list of networkx graphs to be evaluated
- Returns:
mmd distance
- Return type:
float
- ccsd.src.evaluation.stats.eval_graph_list(graph_ref_list: List[Graph], graph_pred_list: List[Graph], methods: List[str] | None = None, kernels: Dict[str, Callable[[ndarray, ndarray], float]] | None = None, folder: str = './') Dict[str, float] [source]#
Evaluate generated generic graphs against a reference set of graphs using a set of methods and their corresponding kernels.
- Parameters:
graph_ref_list (List[nx.Graph]) – reference list of networkx graphs to be evaluated
graph_pred_list (List[nx.Graph]) – target list of networkx graphs to be evaluated
methods (Optional[List[str]], optional) – methods to be evaluated. Defaults to None.
kernels (Optional[Dict[str, Callable[[np.ndarray, np.ndarray], float]]], optional) – kernels to be used for each methods. Defaults to None.
folder (str, optional) – path to the main folder where the ccsd/src/evaluation folders are to locate the orca executable. Defaults to “./”.
- Returns:
dictionary mapping method names to their corresponding scores
- Return type:
Dict[str, float]
- ccsd.src.evaluation.stats.eval_torch_batch(ref_batch: Tensor, pred_batch: Tensor, methods: List[str] | None = None, folder: str = './') Dict[str, float] [source]#
Evaluate generated generic graphs against a reference set of graphs using a set of methods and their corresponding kernels, with the input graphs in torch.Tensor format (adjacency matrices).
- Parameters:
ref_batch (torch.Tensor) – reference batch of adjacency matrices
pred_batch (torch.Tensor) – target batch of adjacency matrices
methods (Optional[List[str]], optional) – methods to be evaluated. Defaults to None.
folder (str, optional) – path to the main folder where the ccsd/src/evaluation folders are to locate the orca executable. Defaults to “./”.
- Returns:
dictionary mapping method names to their corresponding scores
- Return type:
Dict[str, float]
data_generators.py: functions and GraphGenerator class for generating graphs and graph/combinatorial complexes datasets with given properties. Run this script with -h flag to see usage on how to generate graph and combinatorial complex datasets. The arguments are (see ccsd/src/parsers/parser_generator.py for more details):
–data-dir: directory to save generated graphs. Default: “data”. –dataset: name of dataset to generate (default “grid”), choices are [“ego_small”, “community_small”, “ENZYMES”, “ENZYMES_small”, “grid”]. –is_cc: if you want to generate combinatorial complexes instead of graphs –folder: Directory to save the results, load checkpoints, load config, etc. Default: “./”.
Adapted from Jo, J. & al (2022) for the graph generation part.
- ccsd.data.data_generators.n_community(num_communities: int, max_nodes: int, p_inter: float = 0.05) Graph [source]#
Generate a graph with num_communities communities, each of size max_nodes and with inter-community edge probability p_inter. From Niu et al. (2020)
- Parameters:
num_communities (int) – number of communities
max_nodes (int) – maximum number of nodes in each community
p_inter (float, optional) – inter-community edge probability. Defaults to 0.05.
- Returns:
generated graph
- Return type:
nx.Graph
- class ccsd.data.data_generators.GraphGenerator(graph_type: str = 'grid', possible_params_dict: Dict[str, int | ndarray] | None = None, corrupt_func: Callable[[Any], Graph] | None = None)[source]#
Bases:
object
Graph generator class.
- __init__(graph_type: str = 'grid', possible_params_dict: Dict[str, int | ndarray] | None = None, corrupt_func: Callable[[Any], Graph] | None = None) None [source]#
Initialize graph generator.
- Parameters:
graph_type (str, optional) – type of graphs to generate. Defaults to “grid”.
possible_params_dict (Optional[Dict[str, Union[int, np.ndarray]]], optional) – set of parameters to randomly select. Defaults to None.
corrupt_func (Optional[Callable[[Any], nx.Graph]], optional) – optional function that generates a constant graph (for debugging for example). Defaults to None.
- ccsd.data.data_generators.gen_graph_list(graph_type: str = 'grid', possible_params_dict: Dict[str, int | ndarray] | None = None, corrupt_func: Callable[[Any], Graph] | None = None, length: int = 1024, save_dir: str | None = None, file_name: str | None = None, max_node: int | None = None, min_node: int | None = None) List[Graph] [source]#
Generate a list of synthetic graphs.
- Parameters:
graph_type (str, optional) – type of graphs to generate. Defaults to “grid”.
possible_params_dict (Optional[Dict[str, Union[int, np.ndarray]]], optional) – set of parameters to randomly select. Defaults to None.
corrupt_func (Optional[Callable[[Any], nx.Graph]], optional) – optional function that generates a constant graph (for debugging for example). Defaults to None.
length (int, optional) – number of graphs to generate. Defaults to 1024.
save_dir (Optional[str], optional) – where to save the generate list of graph. Defaults to None.
file_name (Optional[str], optional) – name of the file. Defaults to None.
max_node (Optional[int], optional) – maximum number of nodes. Defaults to None.
min_node (Optional[int], optional) – minimum number of nodes. Defaults to None.
- Returns:
list of generated graphs
- Return type:
List[nx.Graph]
- ccsd.data.data_generators.load_dataset(data_dir: str = 'data', file_name: str | None = None) List[Graph] | List[CombinatorialComplex] [source]#
Load an existing dataset as a list of graphs or list of combinatorial complexes from a file.
- Parameters:
data_dir (str, optional) – directory of the dataset. Defaults to “data”.
file_name (Optional[str], optional) – name of the file. Defaults to None.
- Returns:
list of graphs or list of combinatorial complexes
- Return type:
Union[List[nx.Graph], List[CombinatorialComplex]]
- ccsd.data.data_generators.graph_load_batch(min_num_nodes: int = 20, max_num_nodes: int = 1000, name: str = 'ENZYMES', node_attributes: bool = True, graph_labels: bool = True, folder: str = './') List[Graph] [source]#
Load a graph dataset, for ENZYMES, PROTEIN and DD.
- Parameters:
min_num_nodes (int, optional) – minimum number of nodes. Defaults to 20.
max_num_nodes (int, optional) – maximum number of nodes. Defaults to 1000.
name (str, optional) – name of the dataset to load. Defaults to “ENZYMES”.
node_attributes (bool, optional) – if True, also load the node attributes. Defaults to True.
graph_labels (bool, optional) – if True, also load the graph labels. Defaults to True.
folder (str, optional) – directory of the data/dataset/ folders. Defaults to “./”.
- Returns:
list of graphs
- Return type:
List[nx.Graph]
- ccsd.data.data_generators.parse_index_file(filename: str) List[int] [source]#
Parse an index file (list of integers).
- Parameters:
filename (str) – name of the file
- Returns:
list of indices as integers
- Return type:
List[int]
- ccsd.data.data_generators.graph_load(dataset: str = 'cora', folder: str = './') Tuple[spmatrix, List[Graph]] [source]#
Load the citation datasets: cora, citeseer or pubmed.
- Parameters:
dataset (str, optional) – name of the dataset to load. Defaults to “cora”.
folder (str, optional) – directory of the data/dataset/ folders. Defaults to “./”.
- Returns:
tuple of features and the graph
- Return type:
Tuple[sp.spmatrix, List[nx.Graph]]
- ccsd.data.data_generators.citeseer_ego(radius: int = 3, node_min: int = 50, node_max: int = 400, folder: str = './') List[Graph] [source]#
Load the citeseer dataset, keep the largest connected component, and extract the ego graphs (graphs of nodes within a certain radius) with a number of nodes within our range.
- Parameters:
radius (int, optional) – radius. Defaults to 3.
node_min (int, optional) – minimum number of nodes in our dataset. Defaults to 50.
node_max (int, optional) – maximum number of nodes in our dataset. Defaults to 400.
folder (str, optional) – directory of the data/dataset/ folders. Defaults to “./”.
- Returns:
list of (ego) graphs
- Return type:
List[nx.Graph]
- ccsd.data.data_generators.save_dataset(data_dir: str, obj: List[Graph] | List[CombinatorialComplex], save_name: str, save_txt: bool = True) None [source]#
Save the dataset (objects) in the specified directory.
- Parameters:
data_dir (str) – directory to save the dataset
obj (Union[List[nx.Graph], List[CombinatorialComplex]]) – list of objects to save
save_name (str) – name of the dataset
save_txt (bool, optional) – whether to save a txt file with the name and the number of objects (or size of DataLoader). Defaults to True.
- ccsd.data.data_generators.generate_dataset(args: Namespace) None [source]#
Generate a graph/combinatorial complex dataset and save it in the specified directory.
- Parameters:
args (argparse.Namespace) – arguments
- Raises:
NotImplementedError – raise and error if the specified dataset is not implemented
preprocess_for_nspdk.py: preprocess the test molecules for NSPDK.
Adapted from Jo, J. & al (2022)
- ccsd.data.preprocess_for_nspdk.preprocess_nspdk(args: Namespace, print_elapsed_time: bool = True) None [source]#
Preprocess the test molecules for NSPDK
- Parameters:
args (argparse.Namespace) – arguments
print_elapsed_time (bool, True) – if True, print the elapsed time to preprocess the test molecules. Defaults to True.
- Raises:
ValueError – raise an error if the dataset is not supported. Molecule dataset supported: QM9, ZINC250k
preprocess.py: preprocess the molecule datasets (not for NSPDK).
Adapted from Jo, J. & al (2022)
- ccsd.data.preprocess.preprocess(args: Namespace, print_elapsed_time: bool = True) None [source]#
Preprocess the molecules (not for NSPDK)
Adapted from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow
- Parameters:
args (argparse.Namespace) – arguments
print_elapsed_time (bool, optional) – if True, print the elapsed time to preprocess the molecules. Defaults to True.
- Raises:
ValueError – raise an error if the dataset is not supported. Molecule dataset supported: QM9, ZINC250k
data_frame_parser.py: preprocess the molecule datasets (not for NSPDK). Just used to in data/preprocess.py. Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetparsersdata_frame_parser.py Code from Jo, J. & al (2022)
Left untouched.
- class ccsd.data.utils.data_frame_parser.DataFrameParser(preprocessor: GGNNPreprocessor, labels: List[str] | None = None, smiles_col: str = 'smiles', postprocess_label: Callable[[List[str]], List[str]] | None = None, postprocess_fn: Callable[[List[ndarray] | Tuple[ndarray]], List[ndarray] | Tuple[ndarray]] | None = None, logger: Logger | None = None)[source]#
Bases:
object
DataFrame parser class. Just used to in data/preprocess.py.
Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetparsersdata_frame_parser.py
- __init__(preprocessor: GGNNPreprocessor, labels: List[str] | None = None, smiles_col: str = 'smiles', postprocess_label: Callable[[List[str]], List[str]] | None = None, postprocess_fn: Callable[[List[ndarray] | Tuple[ndarray]], List[ndarray] | Tuple[ndarray]] | None = None, logger: Logger | None = None)[source]#
numpytupledataset.py: NumpyTupleDataset class. Just used to in data/preprocess.py. Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Code from Jo, J. & al (2022)
Left untouched.
- class ccsd.data.utils.numpytupledataset.NumpyTupleDataset(datasets, transform=None)[source]#
Bases:
Dataset
NumpyTupleDataset class. Just used to in data/preprocess.py.
Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow
numpytupledataset.py: NumpyTupleDataset class. Just used to in data/preprocess.py. Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetpreprocessorscommon Code from Jo, J. & al (2022)
Left untouched.
- class ccsd.data.utils.smile_to_graph.GGNNPreprocessor(max_atoms=-1, out_size=-1, add_Hs=False, kekulize=True)[source]#
Bases:
object
GGNN Preprocessor. Just used to in data/preprocess.py.
Original code from MoFlow (under MIT License) https://github.com/calvin-zcx/moflow Adapted from chainer_chemistrydatasetpreprocessorscommon