Skip to content

API Reference

from multicons import MultiCons

multicons.MultiCons

Bases: BaseEstimator

MultiCons (Multiple Consensus) algorithm.

MultiCons is a consensus clustering method that uses the frequent closed itemset mining technique to find similarities in the base clustering solutions.

Parameters:

Name Type Description Default
consensus_function str or function

Specifies a consensus function to generate clusters from the available instance sets at each iteration. Currently the following consensus functions are available:

  • consensus_function_10: The simplest approach, used by default. Removes instance sets with inclusion property and groups together intersecting instance sets.
  • consensus_function_12: Similar to consensus_function_10. Uses a merging_threshold to decide whether to merge the intersecting instance sets or to split them (removing the intersection from the bigger set).
  • consensus_function_13: A stricter version of consensus_function_12. Compares the maximal average intersection ratio with the merging_threshold to decide whether to merge the intersecting instance sets or to split them.
  • consensus_function_14: A version of consensus_function_13 that first searches the maximal intersection ratio among all possible intersections prior applying a merge or split decision.
  • consensus_function_15: A graph based approach. Builds an adjacency matrix from the intersection matrix using the decision_threshold. Merges all connected nodes and then splits overlapping instance sets.

To use another consensus function it is possible to pass a function instead of a string value. The function should accept two arguments - a list of sets and an optional merging_threshold, and should update the list of sets in place. See consensus_function_10 for an example.

'consensus_function_10'
similarity_measure str or function

Specifies how to compute the similarity between two clustering solutions. Currently the following similarity measures are available:

  • JaccardSimilarity: Indicates that the pair-wise jaccard similarity measure should be used. This measure is computed with the formula yy / (yy + ny + yn) Where: yy is number of times two points belong to same cluster in both clusterings and ny and yn are the number of times two points belong to the same cluster in one clustering but not in the other.
  • JaccardIndex: Indicates that the set-wise jaccard similarity coefficient should be used. This measure is computed with the formula |X ∩ Y| / (|X| + |Y| - |X ∩ Y|) Where: X and Y are the clustering solutions.

To use another similarity measure it is possible to pass a function instead of a string value. The function should accept two arguments - two numeric numpy arrays (representing the two clustering solutions) and should return a numeric score (indicating how similar the clustering solutions are).

'JaccardSimilarity'
merging_threshold float

Specifies the minimum required ratio (calculated from the intersection between two sets over the size of the smaller set) for which the consensus_function should merge two sets. Applies to consensus_function_12, consensus_function_13, consensus_function_14 and consensus_function_15.

0.5
optimize_label_names bool

Indicates whether the label assignment of the clustering partitions should be optimized to maximize the similarity measure score (using the Hungarian algorithm). By default set to False as the default similarity_measure score (“JaccardSimilarity”) does not depend on which labels are assigned to which cluster.

False

Attributes:

Name Type Description
consensus_function function

The consensus function used to generate clusters from the available instance sets at each iteration.

consensus_vectors list of numpy arrays

The list of proposed consensus clustering candidates.

decision_thresholds list of int

The list of decision thresholds values, corresponding to the consensus vectors (in the same order). A decision threshold indicates how many base clustering solutions were required to agree (at least) to form sub-clusters.

ensemble_similarity list of float

The list of ensemble similarity measures corresponding to the consensus vectors.

labels_ numpy array

The recommended consensus candidate.

optimize_label_names bool

Indicates whether the label assignment of clustering partitions should be optimized or not.

recommended int

The index of the recommended consensus vector.

similarity_measure function

The similarity function used to measure the similarity between two clustering solutions.

stability list of int

The list of stability values, corresponding to the consensus vectors (in the same order). A stability value indicates how many times the same consensus is generated for different decision thresholds.

tree_quality float

The tree quality measure (between 0 and 1). Higher is better.

Raises:

Type Description
ValueError

If consensus_function or similarity_measure is not a function and not one of the allowed string values (mentioned above).

Source code in multicons/core.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
class MultiCons(BaseEstimator):
    """MultiCons (Multiple Consensus) algorithm.

    MultiCons is a consensus clustering method that uses the frequent closed itemset
    mining technique to find similarities in the base clustering solutions.

    Args:
        consensus_function (str or function): Specifies a
            consensus function to generate clusters from the available instance sets at
            each iteration.
            Currently the following consensus functions are available:

            - `consensus_function_10`: The simplest approach, *used by default*.
                Removes instance sets with inclusion property and groups together
                intersecting instance sets.
            - `consensus_function_12`: Similar to `consensus_function_10`. Uses a
                `merging_threshold` to decide whether to merge the intersecting instance
                sets or to split them (removing the intersection from the
                bigger set).
            - `consensus_function_13`: A stricter version of `consensus_function_12`.
                Compares the maximal average intersection ratio with the
                `merging_threshold` to decide whether to merge the intersecting instance
                sets or to split them.
            - `consensus_function_14`: A version of `consensus_function_13` that first
                searches the maximal intersection ratio among all possible intersections
                prior applying a merge or split decision.
            - `consensus_function_15`: A graph based approach. Builds an adjacency
                matrix from the intersection matrix using the `decision_threshold`.
                Merges all connected nodes and then splits overlapping instance sets.

            To use another consensus function it is possible to pass a function instead
            of a string value. The function should accept two arguments - a list of sets
            and an optional `merging_threshold`, and should update the list of sets in
            place. See `consensus_function_10` for an example.
        similarity_measure (str or function): Specifies how to compute the similarity
            between two clustering solutions.
            Currently the following similarity measures are available:

            - `JaccardSimilarity`: Indicates that the pair-wise jaccard similarity
                measure should be used. This measure is computed with the formula
                `yy / (yy + ny + yn)`
                Where: `yy` is number of times two points belong to same cluster in both
                clusterings and `ny` and `yn` are the number of times two points belong
                to the same cluster in one clustering but not in the other.
            - `JaccardIndex`: Indicates that the set-wise jaccard similarity
                coefficient should be used. This measure is computed with the
                formula `|X ∩ Y| / (|X| + |Y| - |X ∩ Y|)`
                Where: X and Y are the clustering solutions.

            To use another similarity measure it is possible to pass a function instead
            of a string value. The function should accept two arguments - two numeric
            numpy arrays (representing the two clustering solutions) and should return a
            numeric score (indicating how similar the clustering solutions are).
        merging_threshold (float): Specifies the minimum required ratio (calculated from
            the intersection between two sets over the size of the smaller set) for
            which the `consensus_function` should merge two sets. Applies to
            `consensus_function_12`, `consensus_function_13`, `consensus_function_14`
            and `consensus_function_15`.
        optimize_label_names (bool): Indicates whether the label assignment of
            the clustering partitions should be optimized to maximize the similarity
            measure score (using the Hungarian algorithm). By default set to `False` as
            the default `similarity_measure` score ("JaccardSimilarity") does not depend
            on which labels are assigned to which cluster.

    Attributes:
        consensus_function (function): The consensus function used to generate clusters
            from the available instance sets at each iteration.
        consensus_vectors (list of numpy arrays): The list of proposed consensus
            clustering candidates.
        decision_thresholds (list of int): The list of decision thresholds values,
            corresponding to the consensus vectors (in the same order). A decision
            threshold indicates how many base clustering solutions were required to
            agree (at least) to form sub-clusters.
        ensemble_similarity (list of float): The list of ensemble similarity measures
            corresponding to the consensus vectors.
        labels_ (numpy array): The recommended consensus candidate.
        optimize_label_names (bool): Indicates whether the label assignment of
            clustering partitions should be optimized or not.
        recommended (int): The index of the recommended consensus vector.
        similarity_measure (function): The similarity function used to measure the
            similarity between two clustering solutions.
        stability (list of int): The list of stability values, corresponding to the
            consensus vectors (in the same order). A stability value indicates how many
            times the same consensus is generated for different decision thresholds.
        tree_quality (float): The tree quality measure (between 0 and 1). Higher is
            better.

    Raises:
        ValueError: If `consensus_function` or `similarity_measure` is not a function
            and not one of the allowed string values (mentioned above).
    """

    # pylint: disable=too-many-instance-attributes

    _consensus_functions = {
        "consensus_function_10": consensus_function_10,
        "consensus_function_12": consensus_function_12,
        "consensus_function_13": consensus_function_13,
        "consensus_function_14": consensus_function_14,
        "consensus_function_15": consensus_function_15,
    }
    _similarity_measures = {
        "JaccardSimilarity": jaccard_similarity,
        "JaccardIndex": jaccard_index,
    }

    def __init__(
        self,
        consensus_function: Union[
            Literal[
                "consensus_function_10",
                "consensus_function_12",
                "consensus_function_13",
                "consensus_function_14",
                "consensus_function_15",
            ],
            Callable[[list[np.ndarray]], None],
        ] = "consensus_function_10",
        merging_threshold: float = 0.5,
        similarity_measure: Union[
            Literal["JaccardSimilarity", "JaccardIndex"],
            Callable[[np.ndarray, np.ndarray], int],
        ] = "JaccardSimilarity",
        optimize_label_names: bool = False,
    ):
        """Initializes MultiCons."""

        self.consensus_function = self._parse_argument(
            "consensus_function", self._consensus_functions, consensus_function
        )
        self.similarity_measure = self._parse_argument(
            "similarity_measure", self._similarity_measures, similarity_measure
        )
        self.merging_threshold = merging_threshold
        self.optimize_label_names = optimize_label_names
        self.consensus_vectors = None
        self.decision_thresholds = None
        self.ensemble_similarity = None
        self.labels_ = None
        self.recommended = None
        self.stability = None
        self.tree_quality = None

    def fit(self, X, y=None, sample_weight=None):  # pylint: disable=unused-argument
        """Computes the MultiCons consensus.

        Args:
            X (list of numeric numpy arrays or a pandas Dataframe): Either a list of
                arrays where each array represents one clustering solution
                (base clusterings), or a Dataframe representing a binary membership
                matrix.
            y (any): Ignored. Not used, present here for API consistency by convention.
            sample_weight (any): Ignored. Not used, present here for API consistency by
                convention.

        Returns:
            self (MultiCons): Returns the (fitted) instance itself.
        """

        if isinstance(X, pd.DataFrame):
            membership_matrix = pd.DataFrame(X, dtype=bool)
            X = build_base_clusterings(X)
        else:
            X = np.array(X, dtype=int)
            # 2 Calculate in-ensemble similarity
            # similarity = in_ensemble_similarity(X)
            # 3 Build the cluster membership matrix M
            membership_matrix = build_membership_matrix(X)

        # 4 Generate FCPs from M for minsupport = 0
        # 5 Sort the FCPs in ascending order according to the size of the instance sets
        frequent_closed_itemsets = linear_closed_itemsets_miner(membership_matrix)
        # 6 MaxDT ← length(BaseClusterings)
        max_d_t = len(X)
        # 7 BiClust ← {instance sets of FCPs built from MaxDT base clusters}
        bi_clust = build_bi_clust(membership_matrix, frequent_closed_itemsets, max_d_t)
        # 8 Assign a label to each set in BiClust to build the first consensus vector
        #   and store it in a list of vectors ConsVctrs
        self.consensus_vectors = [0] * max_d_t
        self.consensus_vectors[max_d_t - 1] = self._assign_labels(bi_clust, X)

        # 9 Build the remaining consensuses
        # 10 for DT = (MaxDT−1) to 1 do
        for d_t in range(max_d_t - 1, 0, -1):
            # 11 BiClust ← BiClust ∪ {instance sets of FCPs built from DT base clusters}
            bi_clust += build_bi_clust(membership_matrix, frequent_closed_itemsets, d_t)
            # 12 Call the consensus function (Algo. 10)
            self.consensus_function(bi_clust, self.merging_threshold)
            # 13 Assign a label to each set in BiClust to build a consensus vector
            #    and add it to ConsVctrs
            self.consensus_vectors[d_t - 1] = self._assign_labels(bi_clust, X)
        # 14 end

        # 15 Remove similar consensuses
        # 16 ST ← Vector of ‘1’s of length MaxDT
        self.decision_thresholds = list(range(1, max_d_t + 1))
        self.stability = [1] * max_d_t
        # 17 for i = MaxDT to 2 do
        i = max_d_t - 1
        while i > 0:
            # 18 Vi ← ith consensus in ConsVctrs
            consensus_i = self.consensus_vectors[i]
            # 19 for j = (i−1) to 1 do
            j = i - 1
            while j >= 0:
                # 20 Vj ← jth consensus in ConsVctrs
                consensus_j = self.consensus_vectors[j]
                # 21 if Jaccard(Vi , Vj ) = 1 then
                if self.similarity_measure(consensus_i, consensus_j) == 1:
                    # 22 ST [i] ← ST [i] + 1
                    self.stability[i] += 1
                    # 23 Remove ST [j]
                    del self.stability[j]
                    del self.decision_thresholds[j]
                    # 24 Remove Vj from ConsVctrs
                    del self.consensus_vectors[j]
                    i -= 1
                j -= 1
            i -= 1
            # 25 end
        # 26 end

        # 27 Find the consensus the most similar to the ensemble
        # 28 L ← length(ConsVctrs)
        consensus_count = len(self.consensus_vectors)
        # 29 TSim ← Vector of ‘0’s of length L
        t_sim = np.zeros(consensus_count)
        # 30 for i = 1 to L do
        for i in range(consensus_count):
            # 31 Ci ← ith consensus in ConsVctrs
            consensus_i = self.consensus_vectors[i]
            # 32 for j = 1 to MaxDT do
            for j in range(max_d_t):
                # 33 Cj ← jth clustering in BaseClusterings
                consensus_j = X[j]
                # 34 TSim[i] ← TSim[i] + Jaccard(Ci,Cj)
                t_sim[i] += self.similarity_measure(consensus_i, consensus_j)
            # 35 end
            # 36 Sim[i] ← TSim[i] / MaxDT
            t_sim[i] /= max_d_t
        # 37 end
        self.recommended = np.where(t_sim == np.amax(t_sim))[0][0]
        self.labels_ = self.consensus_vectors[self.recommended]

        self.tree_quality = 1
        if len(np.unique(self.consensus_vectors[0])) == 1:
            self.tree_quality -= (self.stability[0] - 1) / max(self.decision_thresholds)

        self.ensemble_similarity = t_sim
        return self

    def cons_tree(self) -> graphviz.Digraph:
        """Returns a ConsTree graph. Requires the `fit` method to be called first."""

        graph = graphviz.Digraph()
        graph.attr(
            "graph", label=f"ConsTree\nTree Quality = {self.tree_quality}", labelloc="t"
        )
        unique_count = [
            np.unique(vec, return_counts=True) for vec in self.consensus_vectors
        ]
        max_size = len(self.consensus_vectors[0])

        previous = []
        for i, nodes_count in enumerate(unique_count):
            attributes = {
                "fillcolor": "slategray2",
                "shape": "ellipse",
                "style": "filled",
            }
            if i == self.recommended:
                attributes.update({"fillcolor": "darkseagreen", "shape": "box"})
            for j in range(len(nodes_count[0])):
                node_id = f"{i}{nodes_count[0][j]}"
                attributes["width"] = str(int(9 * nodes_count[1][j] / max_size))
                graph.attr("node", **attributes)
                graph.node(node_id, str(nodes_count[1][j]))
                if i == 0:
                    continue
                for node in np.unique(
                    previous[self.consensus_vectors[i] == nodes_count[0][j]]
                ):
                    graph.edge(f"{i - 1}{node}", node_id)

            previous = self.consensus_vectors[i]
            with graph.subgraph(name="cluster") as sub_graph:
                sub_graph.attr("graph", label="Legend")
                sub_graph.attr("node", shape="box", width="")
                values = [
                    f"DT={self.decision_thresholds[i]}",
                    f"ST={self.stability[i]}",
                    f"Similarity={round(self.ensemble_similarity[i], 2)}",
                ]
                sub_graph.node(f"legend_{i}", " ".join(values))
                if i > 0:
                    sub_graph.edge(f"legend_{i-1}", f"legend_{i}")
        return graph

    def _assign_labels(self, bi_clust: list[set], base_clusterings: np.ndarray):
        """Returns a consensus vector with labels for each instance set in bi_clust."""

        result = np.zeros(len(base_clusterings[0]), dtype=int)
        if not self.optimize_label_names:
            for i, itemset in enumerate(bi_clust):
                result[list(itemset)] = i
            return result
        unique_labels = np.unique(base_clusterings.flatten()).tolist()
        max_label = max(unique_labels)
        for i in range(len(bi_clust) - len(unique_labels)):
            unique_labels.append(max_label + i + 1)

        cost_matrix = pd.DataFrame(
            0.0, index=range(len(bi_clust)), columns=unique_labels
        )
        for i, itemset in enumerate(map(list, bi_clust)):
            for j, label in enumerate(unique_labels):
                labels = np.ones(len(itemset)) * label
                score = np.array(
                    [
                        self.similarity_measure(clustering[itemset], labels)
                        for clustering in base_clusterings
                    ]
                )
                cost_matrix.loc[i, j] = len(itemset) * (1 + score).sum()

        col_ind = linear_sum_assignment(
            cost_matrix.apply(lambda x: x.max() - x, axis=1)
        )[1]

        for i, itemset in enumerate(bi_clust):
            result[list(itemset)] = col_ind[i]
        return result

    @staticmethod
    def _parse_argument(name, arguments, argument) -> Callable:
        """Returns the function that corresponds to the argument."""

        if callable(argument):
            return argument
        value = arguments.get(argument, None)
        if not value:
            raise ValueError(
                f"Invalid value for `{name}` argument. "
                f"Should be one of ({', '.join(arguments.keys())}) or a function. "
                f"But received `{argument}` instead."
            )
        return value

__init__(consensus_function='consensus_function_10', merging_threshold=0.5, similarity_measure='JaccardSimilarity', optimize_label_names=False)

Initializes MultiCons.

Source code in multicons/core.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
def __init__(
    self,
    consensus_function: Union[
        Literal[
            "consensus_function_10",
            "consensus_function_12",
            "consensus_function_13",
            "consensus_function_14",
            "consensus_function_15",
        ],
        Callable[[list[np.ndarray]], None],
    ] = "consensus_function_10",
    merging_threshold: float = 0.5,
    similarity_measure: Union[
        Literal["JaccardSimilarity", "JaccardIndex"],
        Callable[[np.ndarray, np.ndarray], int],
    ] = "JaccardSimilarity",
    optimize_label_names: bool = False,
):
    """Initializes MultiCons."""

    self.consensus_function = self._parse_argument(
        "consensus_function", self._consensus_functions, consensus_function
    )
    self.similarity_measure = self._parse_argument(
        "similarity_measure", self._similarity_measures, similarity_measure
    )
    self.merging_threshold = merging_threshold
    self.optimize_label_names = optimize_label_names
    self.consensus_vectors = None
    self.decision_thresholds = None
    self.ensemble_similarity = None
    self.labels_ = None
    self.recommended = None
    self.stability = None
    self.tree_quality = None

cons_tree()

Returns a ConsTree graph. Requires the fit method to be called first.

Source code in multicons/core.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
def cons_tree(self) -> graphviz.Digraph:
    """Returns a ConsTree graph. Requires the `fit` method to be called first."""

    graph = graphviz.Digraph()
    graph.attr(
        "graph", label=f"ConsTree\nTree Quality = {self.tree_quality}", labelloc="t"
    )
    unique_count = [
        np.unique(vec, return_counts=True) for vec in self.consensus_vectors
    ]
    max_size = len(self.consensus_vectors[0])

    previous = []
    for i, nodes_count in enumerate(unique_count):
        attributes = {
            "fillcolor": "slategray2",
            "shape": "ellipse",
            "style": "filled",
        }
        if i == self.recommended:
            attributes.update({"fillcolor": "darkseagreen", "shape": "box"})
        for j in range(len(nodes_count[0])):
            node_id = f"{i}{nodes_count[0][j]}"
            attributes["width"] = str(int(9 * nodes_count[1][j] / max_size))
            graph.attr("node", **attributes)
            graph.node(node_id, str(nodes_count[1][j]))
            if i == 0:
                continue
            for node in np.unique(
                previous[self.consensus_vectors[i] == nodes_count[0][j]]
            ):
                graph.edge(f"{i - 1}{node}", node_id)

        previous = self.consensus_vectors[i]
        with graph.subgraph(name="cluster") as sub_graph:
            sub_graph.attr("graph", label="Legend")
            sub_graph.attr("node", shape="box", width="")
            values = [
                f"DT={self.decision_thresholds[i]}",
                f"ST={self.stability[i]}",
                f"Similarity={round(self.ensemble_similarity[i], 2)}",
            ]
            sub_graph.node(f"legend_{i}", " ".join(values))
            if i > 0:
                sub_graph.edge(f"legend_{i-1}", f"legend_{i}")
    return graph

fit(X, y=None, sample_weight=None)

Computes the MultiCons consensus.

Parameters:

Name Type Description Default
X list of numeric numpy arrays or a pandas Dataframe

Either a list of arrays where each array represents one clustering solution (base clusterings), or a Dataframe representing a binary membership matrix.

required
y any

Ignored. Not used, present here for API consistency by convention.

None
sample_weight any

Ignored. Not used, present here for API consistency by convention.

None

Returns:

Name Type Description
self MultiCons

Returns the (fitted) instance itself.

Source code in multicons/core.py
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
def fit(self, X, y=None, sample_weight=None):  # pylint: disable=unused-argument
    """Computes the MultiCons consensus.

    Args:
        X (list of numeric numpy arrays or a pandas Dataframe): Either a list of
            arrays where each array represents one clustering solution
            (base clusterings), or a Dataframe representing a binary membership
            matrix.
        y (any): Ignored. Not used, present here for API consistency by convention.
        sample_weight (any): Ignored. Not used, present here for API consistency by
            convention.

    Returns:
        self (MultiCons): Returns the (fitted) instance itself.
    """

    if isinstance(X, pd.DataFrame):
        membership_matrix = pd.DataFrame(X, dtype=bool)
        X = build_base_clusterings(X)
    else:
        X = np.array(X, dtype=int)
        # 2 Calculate in-ensemble similarity
        # similarity = in_ensemble_similarity(X)
        # 3 Build the cluster membership matrix M
        membership_matrix = build_membership_matrix(X)

    # 4 Generate FCPs from M for minsupport = 0
    # 5 Sort the FCPs in ascending order according to the size of the instance sets
    frequent_closed_itemsets = linear_closed_itemsets_miner(membership_matrix)
    # 6 MaxDT ← length(BaseClusterings)
    max_d_t = len(X)
    # 7 BiClust ← {instance sets of FCPs built from MaxDT base clusters}
    bi_clust = build_bi_clust(membership_matrix, frequent_closed_itemsets, max_d_t)
    # 8 Assign a label to each set in BiClust to build the first consensus vector
    #   and store it in a list of vectors ConsVctrs
    self.consensus_vectors = [0] * max_d_t
    self.consensus_vectors[max_d_t - 1] = self._assign_labels(bi_clust, X)

    # 9 Build the remaining consensuses
    # 10 for DT = (MaxDT−1) to 1 do
    for d_t in range(max_d_t - 1, 0, -1):
        # 11 BiClust ← BiClust ∪ {instance sets of FCPs built from DT base clusters}
        bi_clust += build_bi_clust(membership_matrix, frequent_closed_itemsets, d_t)
        # 12 Call the consensus function (Algo. 10)
        self.consensus_function(bi_clust, self.merging_threshold)
        # 13 Assign a label to each set in BiClust to build a consensus vector
        #    and add it to ConsVctrs
        self.consensus_vectors[d_t - 1] = self._assign_labels(bi_clust, X)
    # 14 end

    # 15 Remove similar consensuses
    # 16 ST ← Vector of ‘1’s of length MaxDT
    self.decision_thresholds = list(range(1, max_d_t + 1))
    self.stability = [1] * max_d_t
    # 17 for i = MaxDT to 2 do
    i = max_d_t - 1
    while i > 0:
        # 18 Vi ← ith consensus in ConsVctrs
        consensus_i = self.consensus_vectors[i]
        # 19 for j = (i−1) to 1 do
        j = i - 1
        while j >= 0:
            # 20 Vj ← jth consensus in ConsVctrs
            consensus_j = self.consensus_vectors[j]
            # 21 if Jaccard(Vi , Vj ) = 1 then
            if self.similarity_measure(consensus_i, consensus_j) == 1:
                # 22 ST [i] ← ST [i] + 1
                self.stability[i] += 1
                # 23 Remove ST [j]
                del self.stability[j]
                del self.decision_thresholds[j]
                # 24 Remove Vj from ConsVctrs
                del self.consensus_vectors[j]
                i -= 1
            j -= 1
        i -= 1
        # 25 end
    # 26 end

    # 27 Find the consensus the most similar to the ensemble
    # 28 L ← length(ConsVctrs)
    consensus_count = len(self.consensus_vectors)
    # 29 TSim ← Vector of ‘0’s of length L
    t_sim = np.zeros(consensus_count)
    # 30 for i = 1 to L do
    for i in range(consensus_count):
        # 31 Ci ← ith consensus in ConsVctrs
        consensus_i = self.consensus_vectors[i]
        # 32 for j = 1 to MaxDT do
        for j in range(max_d_t):
            # 33 Cj ← jth clustering in BaseClusterings
            consensus_j = X[j]
            # 34 TSim[i] ← TSim[i] + Jaccard(Ci,Cj)
            t_sim[i] += self.similarity_measure(consensus_i, consensus_j)
        # 35 end
        # 36 Sim[i] ← TSim[i] / MaxDT
        t_sim[i] /= max_d_t
    # 37 end
    self.recommended = np.where(t_sim == np.amax(t_sim))[0][0]
    self.labels_ = self.consensus_vectors[self.recommended]

    self.tree_quality = 1
    if len(np.unique(self.consensus_vectors[0])) == 1:
        self.tree_quality -= (self.stability[0] - 1) / max(self.decision_thresholds)

    self.ensemble_similarity = t_sim
    return self
from multicons import (
    build_membership_matrix, in_ensemble_similarity, linear_closed_itemsets_miner
)

multicons.utils

Utility functions

build_membership_matrix(base_clusterings)

Computes and returns the membership matrix.

Source code in multicons/utils.py
65
66
67
68
69
70
71
72
73
74
def build_membership_matrix(base_clusterings: np.ndarray) -> pd.DataFrame:
    """Computes and returns the membership matrix."""

    if len(base_clusterings) == 0 or not isinstance(base_clusterings[0], np.ndarray):
        raise IndexError("base_clusterings should contain at least one np.ndarray.")

    res = []
    for clusters in base_clusterings:
        res += [clusters == x for x in np.unique(clusters)]
    return pd.DataFrame(np.transpose(res), dtype=bool)

in_ensemble_similarity(base_clusterings)

Returns the average similarity among the base clusters using Jaccard score.

Source code in multicons/utils.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def in_ensemble_similarity(base_clusterings: list[np.ndarray]) -> float:
    """Returns the average similarity among the base clusters using Jaccard score."""

    if not base_clusterings or len(base_clusterings) < 2:
        raise IndexError("base_clusterings should contain at least two np.ndarrays.")

    count = len(base_clusterings)
    index = np.arange(count)
    similarity = pd.DataFrame(0.0, index=index, columns=index)
    average_similarity = np.zeros(count)
    for i in range(count - 1):
        cluster_i = base_clusterings[i]
        for j in range(i + 1, count):
            cluster_j = base_clusterings[j]
            score = jaccard_index(cluster_i, cluster_j)
            similarity.iloc[i, j] = similarity.iloc[j, i] = score
        average_similarity[i] = similarity.iloc[i].sum() / (count - 1)

    average_similarity[count - 1] = similarity.iloc[count - 1].sum() / (count - 1)
    return np.mean(average_similarity)

linear_closed_itemsets_miner(membership_matrix)

Returns a list of frequent closed itemsets using the LCM algorithm.

Source code in multicons/utils.py
114
115
116
117
118
119
120
121
def linear_closed_itemsets_miner(membership_matrix: pd.DataFrame):
    """Returns a list of frequent closed itemsets using the LCM algorithm."""

    transactions = []
    for i in membership_matrix.index:
        transactions.append(np.nonzero(membership_matrix.iloc[i].values)[0].tolist())
    frequent_closed_itemsets = eclat(transactions, target="c", supp=0, algo="o", conf=0)
    return sorted(map(lambda x: frozenset(x[0]), frequent_closed_itemsets), key=len)
from multicons import consensus_function_10

multicons.consensus

Consensus functions definitions.

consensus_function_10(bi_clust, merging_threshold=None)

Returns a modified bi_clust (set of unique instance sets).

Source code in multicons/consensus.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def consensus_function_10(bi_clust: list[set], merging_threshold=None):
    """Returns a modified bi_clust (set of unique instance sets)."""

    i = 0
    count = len(bi_clust)
    while i < count - 1:
        bi_clust_i = bi_clust[i]
        j = i + 1
        while j < count:
            bi_clust_j = bi_clust[j]
            intersection_size = len(bi_clust_i.intersection(bi_clust_j))
            if intersection_size == 0:
                j += 1
                continue
            if intersection_size == len(bi_clust_i):
                # Bi⊂Bj
                del bi_clust[i]
                count -= 1
                i -= 1
                break
            if intersection_size == len(bi_clust_j):
                # Bj⊂Bi
                del bi_clust[j]
                count -= 1
                continue
            bi_clust[j] = bi_clust_i.union(bi_clust_j)
            del bi_clust[i]
            count -= 1
            i -= 1
            break
        i += 1

consensus_function_12(bi_clust, merging_threshold=0.5)

Returns a modified bi_clust (set of unique instance sets).

Source code in multicons/consensus.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def consensus_function_12(bi_clust: list[set], merging_threshold: float = 0.5):
    """Returns a modified bi_clust (set of unique instance sets)."""

    i = 0
    count = len(bi_clust)
    while i < count - 1:
        bi_clust_i = bi_clust[i]
        bi_clust_i_size = len(bi_clust_i)
        j = i + 1
        while j < count:
            bi_clust_j = bi_clust[j]
            bi_clust_j_size = len(bi_clust_j)
            bi_clust_intersection = bi_clust_i.intersection(bi_clust_j)
            intersection_size = len(bi_clust_intersection)
            if intersection_size == 0:
                j += 1
                continue
            if intersection_size == bi_clust_i_size:
                # Bi⊂Bj
                del bi_clust[i]
                count -= 1
                i -= 1
                break
            if intersection_size == bi_clust_j_size:
                # Bj⊂Bi
                del bi_clust[j]
                count -= 1
                continue
            if (
                intersection_size >= bi_clust_i_size * merging_threshold
                or intersection_size >= bi_clust_j_size * merging_threshold
            ):
                # Merge intersecting sets (Bj∩Bi / |Bi| > MT or Bj∩Bi / |Bj| > MT)
                bi_clust[j] = bi_clust_i.union(bi_clust_j)
                del bi_clust[i]
                count -= 1
                i -= 1
                break
            # Split intersecting sets (remove intersection from bigger set)
            if bi_clust_i_size <= bi_clust_j_size:
                bi_clust[j] = bi_clust_j - bi_clust_intersection
                continue
            bi_clust[i] = bi_clust_i - bi_clust_intersection
            i -= 1
            break
        i += 1

consensus_function_13(bi_clust, merging_threshold=0.5)

Returns a modified bi_clust (set of unique instance sets).

Source code in multicons/consensus.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
def consensus_function_13(bi_clust: list[set], merging_threshold: float = 0.5):
    """Returns a modified bi_clust (set of unique instance sets)."""

    i = 0
    count = len(bi_clust)
    merging_threshold *= 2
    while i < count - 1:
        bi_clust_i = bi_clust[i]
        bi_clust_i_size = len(bi_clust_i)
        j = i + 1
        best_intersection_ratio = 0
        best_intersection_ratio_j = 0
        broken = False
        while j < count:
            bi_clust_j = bi_clust[j]
            bi_clust_j_size = len(bi_clust_j)
            bi_clust_intersection = bi_clust_i.intersection(bi_clust_j)
            intersection_size = len(bi_clust_intersection)
            if intersection_size == 0:
                j += 1
                continue
            if intersection_size == bi_clust_i_size:
                # Bi⊂Bj
                del bi_clust[i]
                count -= 1
                i -= 1
                broken = True
                break
            if intersection_size == bi_clust_j_size:
                # Bj⊂Bi
                del bi_clust[j]
                count -= 1
                continue
            average_intersection_ratio = (
                intersection_size
                * (bi_clust_j_size + bi_clust_i_size)
                / (bi_clust_j_size * bi_clust_i_size)
            )
            if average_intersection_ratio > best_intersection_ratio:
                best_intersection_ratio = average_intersection_ratio
                best_intersection_ratio_j = j
            j += 1

        if not broken and best_intersection_ratio > 0:
            if best_intersection_ratio >= merging_threshold:
                # Merge
                bi_clust[best_intersection_ratio_j] = bi_clust_i.union(
                    bi_clust[best_intersection_ratio_j]
                )
                del bi_clust[i]
                count -= 1
                continue
            # Split
            if bi_clust_i_size <= bi_clust_j_size:
                bi_clust[best_intersection_ratio_j] = (
                    bi_clust[best_intersection_ratio_j] - bi_clust_i
                )
                continue
            bi_clust[i] = bi_clust_i - bi_clust[best_intersection_ratio_j]
            continue
        i += 1

consensus_function_14(bi_clust, merging_threshold=0.5)

Returns a modified bi_clust (set of unique instance sets).

Source code in multicons/consensus.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def consensus_function_14(bi_clust: list[set], merging_threshold: float = 0.5):
    """Returns a modified bi_clust (set of unique instance sets)."""

    while True:
        _remove_subsets(bi_clust)
        bi_clust_size = len(bi_clust)
        if bi_clust_size == 1:
            return
        intersection_matrix = pd.DataFrame(
            columns=range(bi_clust_size), index=range(bi_clust_size), dtype=float
        )

        for i in range(bi_clust_size - 1):
            bi_clust_i = bi_clust[i]
            bi_clust_i_size = len(bi_clust_i)
            for j in range(i + 1, bi_clust_size):
                bi_clust_j = bi_clust[j]
                bi_clust_j_size = len(bi_clust_j)
                intersection_size = len(bi_clust_i.intersection(bi_clust_j))
                if intersection_size == 0:
                    continue
                intersection_matrix.iloc[i, j] = intersection_size / bi_clust_i_size
                intersection_matrix.iloc[j, i] = intersection_size / bi_clust_j_size

        if intersection_matrix.isna().values.all():
            break

        pointer = pd.DataFrame(columns=range(3), index=range(bi_clust_size))
        for i in range(bi_clust_size):
            if not intersection_matrix.iloc[i, :].isna().all():
                pointer.iloc[i, 0] = i
                pointer.iloc[i, 1] = intersection_matrix.iloc[i, :].argmax()
                pointer.iloc[i, 2] = intersection_matrix.iloc[i, pointer.iloc[i, 1]]

        pointer.sort_values(2, inplace=True, ascending=False)
        pointer = pointer[pointer.iloc[:, 2] > 0.0]

        for k in range(pointer.shape[0]):
            i = pointer.iloc[k, 0]
            j = pointer.iloc[k, 1]
            value = intersection_matrix.iloc[i, j]
            if np.isnan(value):
                continue
            if value >= merging_threshold:
                bi_clust[i] = bi_clust[i].union(bi_clust[j])
                bi_clust[j] = set()
                intersection_matrix.iloc[i, :] = np.nan
                intersection_matrix.iloc[:, j] = np.nan
                continue
            if len(bi_clust[i]) <= len(bi_clust[j]):
                bi_clust[j] = bi_clust[j] - bi_clust[i]
                intersection_matrix.iloc[j, :] = np.nan
                intersection_matrix.iloc[:, j] = np.nan
                continue
            bi_clust[i] = bi_clust[i] - bi_clust[j]
            intersection_matrix.iloc[i, :] = np.nan
            intersection_matrix.iloc[:, i] = np.nan

consensus_function_15(bi_clust, merging_threshold=0.5)

Returns a modified bi_clust (set of unique instance sets).

Source code in multicons/consensus.py
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
def consensus_function_15(bi_clust: list[set], merging_threshold: float = 0.5):
    """Returns a modified bi_clust (set of unique instance sets)."""

    _remove_subsets(bi_clust)
    bi_clust_size = len(bi_clust)
    if bi_clust_size == 1:
        return
    intersection_matrix = pd.DataFrame(
        0, columns=range(bi_clust_size), index=range(bi_clust_size), dtype=float
    )
    for i in range(bi_clust_size - 1):
        bi_clust_i = bi_clust[i]
        bi_clust_i_size = len(bi_clust_i)
        for j in range(i + 1, bi_clust_size):
            bi_clust_j = bi_clust[j]
            bi_clust_j_size = len(bi_clust_j)
            intersection_size = len(bi_clust_i.intersection(bi_clust_j))
            if intersection_size == 0:
                continue
            intersection_matrix.iloc[i, j] = intersection_size / bi_clust_i_size
            intersection_matrix.iloc[j, i] = intersection_size / bi_clust_j_size

    cluster_indexes = sp.coo_matrix(intersection_matrix >= merging_threshold).nonzero()
    for index, i in enumerate(cluster_indexes[0]):
        j = cluster_indexes[1][index]
        bi_clust[i] = bi_clust[j] = bi_clust[i].union(bi_clust[j])

    _remove_subsets(bi_clust)
    bi_clust_size = len(bi_clust)
    if bi_clust_size == 1:
        return

    for i in range(bi_clust_size - 1):
        bi_clust_i = bi_clust[i]
        bi_clust_i_size = len(bi_clust_i)
        for j in range(i + 1, bi_clust_size):
            bi_clust_j = bi_clust[j]
            bi_clust_j_size = len(bi_clust_j)
            bi_clust_intersection = bi_clust_i.intersection(bi_clust_j)
            if len(bi_clust_intersection) == 0:
                continue
            if bi_clust_i_size <= bi_clust_j_size:
                bi_clust[j] = bi_clust_j - bi_clust_intersection
                continue
            bi_clust[i] = bi_clust_i - bi_clust_intersection

    _remove_subsets(bi_clust)