Module bussilab.clustering

Module with some clustering tools

Functions

def daura(adj, weights=None, *, min_size=0, max_clusters=None)
Expand source code
def daura(adj,weights=None,*,min_size=0,max_clusters=None):
    """Clustering algorithm introduced in Daura et al, Angew. Chemie (1999).

       WARNING: important fix in v0.0.39 for version with weights

       Parameters
       ----------

       adj : array_like, square matrix

           adj[i,j] contains 1 (or True) if frames i and j are adjacent, 0 (or False) otherwise.

       weights : array_like, optional

           weights[i] contains the weight of the i-th frame.

       min_size : number

           Minimum cluster size. Clusters smaller than this size are not reported.
           When using weights, the cluster size is defined as the sum of the weights of
           the members of the cluster.

       max_clusters : int

           Maximum number of clusters.

       Example
       -------

       ```
       import scipy.spatial.distance as distance
       dist=distance.squareform(distance.pdist(trajectory))
       clustering.daura(dist<0.7)
       ```
    """
    adj=adj.copy()  # take a copy (adj is modified while clustering)
    indexes=np.arange(len(adj))
    clusters=[]
    ww=[]
    if weights is not None:
        weights = weights.copy()  # take a copy (weights is modified while clustering)
    while len(indexes)>0:
        if weights is not None:
            d=np.sum(adj*weights[:,np.newaxis],axis=0)
        else:
            d=np.sum(adj,axis=0)
        n=np.argmax(d)
        if d[n]<min_size:
            break
        ww.append(d[n])
        ii=np.where(adj[n]>0)[0]
        clusters.append(indexes[ii])
        if max_clusters:
            if len(clusters) >= max_clusters:
                break
        adj=np.delete(adj,ii,axis=0)
        adj=np.delete(adj,ii,axis=1)
        if weights is not None:
            weights=np.delete(weights,ii)
        indexes=np.delete(indexes,ii)
    return ClusteringResult(method="daura",clusters=clusters, weights=ww)

Clustering algorithm introduced in Daura et al, Angew. Chemie (1999).

WARNING: important fix in v0.0.39 for version with weights

Parameters

adj : array_like, square matrix
adj[i,j] contains 1 (or True) if frames i and j are adjacent, 0 (or False) otherwise.
weights : array_like, optional
weights[i] contains the weight of the i-th frame.
min_size : number
Minimum cluster size. Clusters smaller than this size are not reported. When using weights, the cluster size is defined as the sum of the weights of the members of the cluster.
max_clusters : int
Maximum number of clusters.

Example

import scipy.spatial.distance as distance
dist=distance.squareform(distance.pdist(trajectory))
clustering.daura(dist<0.7)
def max_clique(adj, weights=None, *, min_size=0, max_clusters=None, use_networkit=False)
Expand source code
def max_clique(adj,weights=None,*,min_size=0,max_clusters=None,use_networkit=False):
    """Clustering algorithm used in [Reisser et al, NAR (2020)](https://doi.org/10.1093/nar/gkz1184).

       Parameters
       ----------

       adj : array_like, square matrix

           adj[i,j] contains 1 (or True) if frames i and j are adjacent, 0 (or False) otherwise.

       weights : array_like, optional

           weights[i] contains the weight of the i-th frame.

       min_size : number

           Minimum cluster size. Clusters smaller than this size are not reported.
           When using weights, the cluster size is defined as the sum of the weights of
           the members of the cluster.

       max_clusters : int

           Maximum number of clusters.

       use_networkit : bool, optional

           if True, use a networkit implementation that seems to be faster.
           It requires python package networkit to be installed in advance!

       Example
       -------

       ```
       import scipy.spatial.distance as distance
       dist=distance.squareform(distance.pdist(trajectory))
       clustering.max_clique(dist<0.7)
       ```
    """
    # weights: optional weights
    # if adj is a graph, it will be copied
    cliques=[]
    ww=[]
    if use_networkit:
        import networkit # pylint: disable=import-error
        graph=networkit.nxadapter.nx2nk(networkx.Graph(adj))
        graph.removeSelfLoops()
        while graph.numberOfNodes()>0:
            if weights is None:
                cl=networkit.clique.MaximalCliques(graph,maximumOnly=True)
                cl.run()
                maxi=cl.getCliques()[0]
                maxw=len(maxi)
            else:
                cl=networkit.clique.MaximalCliques(graph)
                cl.run()
                maxw=0.0
                for i in cl.getCliques():
                    w=np.sum(weights[i])
                    if w > maxw:
                        maxi=i
                        maxw=w
            if maxw<min_size:
                break
            cliques.append(maxi)
            ww.append(maxw)
            if max_clusters is not None:
                if len(cliques)>=max_clusters:
                    break
            for i in maxi:
                graph.removeNode(i)
    else:
        graph=networkx.Graph(adj)
        while graph.number_of_nodes()>0:
            maxw=0.0
            for i in networkx.algorithms.clique.find_cliques(graph):
                if weights is not None:
                    w=np.sum(weights[i])
                else:
                    w=len(i)
                if w > maxw:
                    maxi=i
                    maxw=w
            if maxw<min_size:
                break
            cliques.append(maxi)
            ww.append(maxw)
            if max_clusters is not None:
                if len(cliques)>=max_clusters:
                    break
            graph.remove_nodes_from(maxi)
    return ClusteringResult(method="max_clique",clusters=cliques, weights=ww)

Clustering algorithm used in Reisser et al, NAR (2020).

Parameters

adj : array_like, square matrix
adj[i,j] contains 1 (or True) if frames i and j are adjacent, 0 (or False) otherwise.
weights : array_like, optional
weights[i] contains the weight of the i-th frame.
min_size : number
Minimum cluster size. Clusters smaller than this size are not reported. When using weights, the cluster size is defined as the sum of the weights of the members of the cluster.
max_clusters : int
Maximum number of clusters.
use_networkit : bool, optional
if True, use a networkit implementation that seems to be faster. It requires python package networkit to be installed in advance!

Example

import scipy.spatial.distance as distance
dist=distance.squareform(distance.pdist(trajectory))
clustering.max_clique(dist<0.7)
def qt(distances, cutoff, weights=None, *, min_size=0, max_clusters=None)
Expand source code
def qt(distances,cutoff,weights=None,*,min_size=0,max_clusters=None):
    """Quality threshold clustering.

       The method is explained in the [original paper](https://doi.org/10.1101/gr.9.11.1106).
       The implementation has been adapted from [this one](https://github.com/rglez/QT),
       which is also released under a GPL licence.
       Thus, if you use this algorithm please cite [this article](https://doi.org/10.1021/acs.jcim.9b00558),
       which also discusses the important differences between this algorithm and the Daura et al algorithm
       in the context of analysing molecular dynamics simulations.
       Additionally, mention which exact version of the bussilab package you used.

       The implementation included here, at variance with the original one, allows passing weights
       and can be used with arbitrary metrics. In addition, it also reports clusters of size 1
       (unless one passes max_clusters>1). The code is optimized when compared with the original one,
       and speed can be further increased by passing `np.array(distances,dtype='float32')` to distances.

       WARNING: important fix in v0.0.39 for version with weights

       As of version v0.0.40, clusters with the same number of members are prioritized based on their
       diameter (smaller diameter gets the priority). This is expected to make non-weighted calculations
       more reproducible, but might change some results when compared with previous versions.
       In addition, when growing a single candidate cluster, it two points are at the same
       distance from the growing cluster the one with higher weight is chosen. With these
       priorities, any choice where either (a) weights are float or (b) distances are float
       should lead to deterministing clustering irrespective of roundoff errors, assuming
       floats are never identical.


       Parameters
       ----------

       distances : array_like, square matrix

           distances[i,j] contains the distance between i and j frame.

       cutoff : number

           maximum distance for two frames to be included in the same cluster

       weights : array_like, optional

           weights[i] contains the weight of the i-th frame.

       min_size : number

           Minimum cluster size. Clusters smaller than this size are not reported.
           When using weights, the cluster size is defined as the sum of the weights of
           the members of the cluster.

       max_clusters : int

           Maximum number of clusters.

       Example
       -------

       ```
       import scipy.spatial.distance as distance
       dist=distance.squareform(distance.pdist(trajectory))
       clustering.qt(dist,0.7)

       clustering.qt(np.array(dist,dtype='float32')) # should be slightly faster
       ```
    """
    clusters=[]
    ww=[]
    N=len(distances)
    if weights is None:
        weights=np.ones(N,dtype="int")
    else:
        weights=weights.copy()
    distances=distances.copy()
    np.fill_diagonal(distances,0.0)
    indexes=np.arange(N)

    ncluster=0
    while len(weights)>0:
        degrees=np.sum((distances < cutoff)*weights[:,np.newaxis],axis=0)
        sorted_indexes=np.argsort(-degrees)

        cluster_size=0
        diameter=0.0
        cluster=[]
        i_degrees=0
        for i_degrees in range(len(sorted_indexes)):

            if degrees[sorted_indexes[i_degrees]] < cluster_size: # optimization
                break

            next_=sorted_indexes[i_degrees]
            precluster = [next_]
            new_cluster_diameter=0.0
            candidates=_qt_outer(distances,next_,cutoff)
            dist_from_cluster=distances[next_][candidates]
            while True:
                (next_,minval)=_qt_inner(distances,dist_from_cluster,candidates,cutoff,weights)
                if(next_<0):
                    break
                precluster.append(next_)
                if minval > new_cluster_diameter:
                    new_cluster_diameter = minval
            new_cluster_size=np.sum(weights[precluster])
            # pick largest cluster (sum of weights)
            # if same size, pick the most compact one (smaller diameter)
            if new_cluster_size > cluster_size or (new_cluster_size == cluster_size and new_cluster_diameter < diameter):
                cluster_size = new_cluster_size
                cluster = precluster
                diameter = new_cluster_diameter

        if cluster_size < min_size:
            break

        ncluster += 1

        clusters.append(indexes[cluster])
        ww.append(cluster_size)

        if max_clusters is not None:
            if ncluster >= max_clusters:
                break

        # see https://stackoverflow.com/questions/63563151/how-to-remove-columns-and-rows-at-certain-indexes-in-numpy-at-the-same-time
        idx = list(set(range(len(distances))).difference(cluster))
        distances=distances[np.ix_(idx,idx)]
        weights=weights[idx]
        indexes=indexes[idx]
    return ClusteringResult(method="qt",clusters=clusters, weights=ww)

Quality threshold clustering.

The method is explained in the original paper. The implementation has been adapted from this one, which is also released under a GPL licence. Thus, if you use this algorithm please cite this article, which also discusses the important differences between this algorithm and the Daura et al algorithm in the context of analysing molecular dynamics simulations. Additionally, mention which exact version of the bussilab package you used.

The implementation included here, at variance with the original one, allows passing weights and can be used with arbitrary metrics. In addition, it also reports clusters of size 1 (unless one passes max_clusters>1). The code is optimized when compared with the original one, and speed can be further increased by passing np.array(distances,dtype='float32') to distances.

WARNING: important fix in v0.0.39 for version with weights

As of version v0.0.40, clusters with the same number of members are prioritized based on their diameter (smaller diameter gets the priority). This is expected to make non-weighted calculations more reproducible, but might change some results when compared with previous versions. In addition, when growing a single candidate cluster, it two points are at the same distance from the growing cluster the one with higher weight is chosen. With these priorities, any choice where either (a) weights are float or (b) distances are float should lead to deterministing clustering irrespective of roundoff errors, assuming floats are never identical.

Parameters

distances : array_like, square matrix
distances[i,j] contains the distance between i and j frame.
cutoff : number
maximum distance for two frames to be included in the same cluster
weights : array_like, optional
weights[i] contains the weight of the i-th frame.
min_size : number
Minimum cluster size. Clusters smaller than this size are not reported. When using weights, the cluster size is defined as the sum of the weights of the members of the cluster.
max_clusters : int
Maximum number of clusters.

Example

import scipy.spatial.distance as distance
dist=distance.squareform(distance.pdist(trajectory))
clustering.qt(dist,0.7)

clustering.qt(np.array(dist,dtype='float32')) # should be slightly faster

Classes

class ClusteringResult (*, method: str, clusters: list, weights: list | None)
Expand source code
class ClusteringResult(Result):
    """Result of a `bussilab.clustering` calculation."""
    def __init__(self,
                *,
                method: str,
                clusters: list,
                weights: Optional[list]
                ):
        self.method = method
        """`str` containing the name of the method used."""
        self.clusters = clusters
        """`list of lists` containing the members of each cluster."""
        self.weights = weights
        """`list` containing the weights of the clusters."""

Result of a bussilab.clustering calculation.

Ancestors

Instance variables

var clusters

list of lists containing the members of each cluster.

var method

str containing the name of the method used.

var weights

list containing the weights of the clusters.