1 Introduction

In today’s rapidly developing era of computer technology, a large number of complex types of data emerge. In order to discover the hidden information from these data, data mining technologies have been studied in statistics, databases, machine learning and other aspects [1]. Cluster analysis is an important means in data mining. From the perspective of data mining, its task is to analyze the characteristics of data sets and provide technical methods for probing meaningful and practical worth knowledge [2, 3]. The main idea of the clustering algorithm is to divide a given population into groups or clusters according to the similarity between the objects. On the one hand, the relevant objects is divided into one group as much as possible, on the other hand, the objects between different groups are as irrelevant as possible [4]. Cluster analysis is often used as a preparatory process to solve other problems, so it has been widely applied in many fields, such as medical diagnosis, biological data analysis, image processing, database management, etc. [5,6,7].

Nowadays, a variety of clustering algorithms have been proposed, all of which have their own strengths and weaknesses. A single clustering algorithm cannot deal with all data types and find arbitrary shape clusters [8]. According to their respective characteristics, traditional clustering methods can be roughly divided into five categories, including partitioning clustering, density-based clustering, grid-based clustering, model-based clustering and hierarchical clustering [9]. K-Means [10] is a classical clustering algorithm based on partition, which has a good clustering effect on Spherical clusters. However, this algorithm is sensitive to the center of the initially selected cluster and is prone to local optimal solution, so it is difficult to detect non-spherical clusters. DBSCAN [11] is a density-based clustering algorithm that can find clusters of any shape and can properly handle outliers. However, the algorithm is very susceptible to the two parameters of the neighborhood radius and the threshold of the core point, and they both need to be specified by the user based on certain experience. STING [12] is a grid-based clustering algorithm. The grid is divided into hierarchical structures with high processing efficiency, but at the same time it may reduce the quality and accuracy of the cluster. SOM [13] is a clustering algorithm based on neural network model. The algorithm is simple in thought and has the function of visualization, but it is hard to visualize more than two-dimensional data. Hierarchical clustering is bidirectional. One is from a single point aggregation to a cluster of the whole data set, that is, bottom-up agglomerative hierarchical clustering. The other is to take the whole data set as an initial cluster, and then continuously divide it into atomic data objects, that is, top-down divisive hierarchical clustering [14, 15]. Whether it is agglomerative or divisive hierarchical clustering, the disadvantage is that the operations of merging and dividing are irreversible, so merging or dividing is very important in selecting clusters. Chameleon is a typical agglomerative hierarchical clustering based on dynamic modeling proposed by Karypis et al. [16]. It overcomes the traditional clustering limitations and can find clusters of any shape, size and density with high quality.

Chameleon algorithm consists of three key steps: sparse graph construction, graph partition and agglomerative hierarchical clustering, as shown in Fig. 1. The algorithm combines the interconnectivity and closeness of objects to form new clusters dynamically, which makes the relationship within the cluster close and the relationship between the clusters sparse. Furthermore, this algorithm considers the internal characteristics of the cluster. It not only doesn’t rely on the static model provided by the user, but also can automatically adapt to the internal characteristics of the merged clusters. Therefore, Chameleon algorithm can detect clusters of different shapes, sizes and densities. Although Chameleon algorithm performs well in many data sets, there are still some shortcomings. Firstly, the value of k in constructing k-nearest neighbors graph is sensitive and the k value range is very wide, which is difficult to determine [17]. Secondly, the number of sub-graphs needs to be specified by the user in the graph-partitioning stage. The partitioning technology used is hMETIS algorithm [18]. This algorithm is a kind of multi-level partition, which divides the graph according to the weight of the minimized truncated edge. The result of each partition is uncertain, so it affects the final clustering result [19]. Finally, Chameleon clustering algorithm uses the thinking of aggregative hierarchical clustering to merge. The stop parameters, that is, the number of clusters finally clustered, also need to be specified by users themselves, which has great subjectivity and directly affects the quality of clustering results.

Fig. 1
figure 1

The main process of Chameleon algorithm

In recent years, different scholars have improved the Chameleon algorithm for the above shortcomings. Xue et al. [17] proposed a new Chameleon algorithm, which solves the minimum half of graph partitioning by introducing the network module evaluation function. Although, the essence of its partition graph is still hMETIS partition algorithm, which can’t solve the uncertainty problem. Guo et al. [19] proposed AChameleon algorithm, which uses simple hierarchical clustering instead of traditional hMETIS algorithm to generate sub-clusters. It is a pity that compared with the original Chameleon algorithm, this algorithm takes a long time to execute. Zhang et al. [20] proposed E_CFSFDP algorithm, which combines a novel density peak clustering (CFSFDP) with Chameleon algorithm. The CFSFDP algorithm generates the initial sub-clusters, and Chameleon algorithm is used to merge sub-clusters. However, there are many initialization parameters in the algorithm, which are difficult to determine without prior knowledge. Barton et al. [21] proposed Chameleon2 algorithm, which is an automatic cut-off selection method to find high-quality clustering. It does not need expert interaction and parameter setting, and can solve various data sets efficiently. However, there is still room for improvement in the recursive graph partition stage. These improved algorithms start with the reduction of parameters and the partition of replacement graph, but they still have some problems.

In order to overcome the shortcomings of Chameleon algorithm, this paper proposes a Chameleon algorithm based on mutual k-nearest neighbors (MChameleon). Firstly, the mutual k-nearest neighbors algorithm is used to generate sub-clusters directly, which avoids the uncertainty caused by partition technology. Moreover, MC modularity is introduced to determine the final clustering results automatically, which does not require the user to specify the number of clusters and avoids subjectivity. Based on this, the Chameleon algorithm was improved, and the experiments proved that the proposed improved algorithm greatly reduces the parameter input on the basis of guaranteeing the accuracy. To summarize, the major contributions of MChameleon algorithm include the following:

  1. 1.

    We propose to construct a mutual k-nearest neighbors graph to directly generate sub-clusters, avoiding the uncertainty caused by graph partition technology.

  2. 2.

    We introduced MC modularity to automatically identify the final clustering results, without the need for users to specify the number of clusters, which avoids subjectivity.

  3. 3.

    Our experimental results show that MChameleon algorithm can not only maintain the advantages of the original Chameleon algorithm to find clusters of any shape and size, but also improve the clustering quality on many data sets when the user-specified parameters are reduced.

The rest of this paper is organized as follows. In Section 2, we introduce the principle of the original Chameleon algorithm, the idea of mutual k-nearest neighbors algorithm and the concept of MC modularity. In Section 3, we make a detailed description of the MChameleon algorithm. In Section 4, we give the experimental analysis in artificial data sets and UCI data sets, and analyze the performance of proposed algorithms. Finally, we conclude this paper along with intending work in last section.

2 Related work

2.1 Original Chameleon algorithm

The advantage of Chameleon algorithm is that it combines relative interconnectivity and relative closeness to measure the similarity between clusters, so as to identify the most similar pair of clusters. In fact, there are similar performances in other fields. Wang et al. [22] uses linking density and cohesion of communities to determine the relationship between communities. Among them, linking density reflects the degree of closeness within a community, and cohesion of communities describes the independence of a community from other communities in the network. Combining the linking density and cohesion of communities, the modularity is obtained to identify different communities. Garruzzo et al. [23] proposed a clustering technology based on HISENE semantic negotiation protocol, which uses similarity values with lexical, structural and semantic components. Therefore, it is very effective in many fields to consider the similarity value between objects by multiple quantities.

Chameleon [16] initializes the data set based on graphs, and the process of cluster discovery is combined with novel hierarchical clustering. Chameleon algorithm is a framework based on dynamic modeling, which consists of three phases. The process is as follows. First, the data set is constructed into k-nearest neighbors parse graph GKNN. The construction process of k-nearest neighbors graph can be seen in Section 2.2. The second step is to partition the k-nearest neighbors graph into a large number of relatively small sub-graphs, in which the number of sub-graphs needs be specified. The hMETIS partitioning algorithm is used in this paper. The idea of this algorithm is to divide a graph into sub-graphs of approximate size while minimizing cut edges [21]. Finally, the divided sub-graphs are regarded as the initial sub-clusters, and then the most similarity between two sub-clusters is selected for merging until the specified number of clustering clusters is reached. The similarity between clusters is determined by the relative interconnectivity and relative closeness of clusters. The two clusters will be merged preferentially only when the interconnectivity and closeness between the objects in the merged cluster is similar to each original cluster. The algorithm considers the inherent characteristics of the cluster itself, so it can efficiently cluster spatial data.

The relative interconnectivity between clusters Ci and Cj is defined as:

$$ RI\left({C}_i,{C}_j\right)=\frac{\left| EC\left({C}_i,{C}_j\right)\right|}{\frac{\left| EC\left({C}_i\right)\right|+\left| EC\left({C}_j\right)\right|}{2}} $$
(1)

where ∣EC(Ci, Cj)∣ is the sum of the weight of the edges that straddle clusters Ci and Cj, ∣EC(Ci)∣ and ∣EC(Cj)∣ are the sum of weights of the edges that are in the min-cut bisector of clusters Ci and Cj respectively.

The relative closeness between clusters Ci and Cj is shown in the following formula (2):

$$ RC\left({C}_i,{C}_j\right)=\frac{\overline{S} EC\left({C}_i,{C}_j\right)}{\frac{\left|{C}_i\right|}{\left|{C}_i\right|+\left|{C}_j\right|}\overline{S} EC\left({C}_i\right)+\frac{\left|{C}_j\right|}{\left|{C}_i\right|+\left|{C}_j\right|}\overline{S} EC\left({C}_j\right)} $$
(2)

where \( \overline{S} EC\left({C}_i,{C}_j\right) \) is the average weight of edges that connect clusters Ci and Cj, \( \overline{S} EC\left({C}_i\right) \) and \( \overline{S} EC\left({C}_j\right) \) are the average weight of the edges cutting of the minimum binary cluster Ci and Cj respectively, and |Ci| and |Cj| are the number of nodes in each cluster.

The similarity between clusters is measured according to the above two indexes. The higher the relative interconnectivity and relative closeness between the clusters, the higher the similarity between the clusters, indicating that the two clusters will be preferentially merged. The similarity function of cluster Ci and Cj is defined as follows:

$$ RI\left({C}_i,{C}_j\right)\ast RC{\left({C}_i,{C}_j\right)}^{\alpha } $$
(3)

where α is user-specified parameter to adjust these two indexes. If α > 1, it gives higher preference to the relative closeness, and when α < 1, it pays more attention to the relative interconnectivity.

The overall process of the Chameleon algorithm is shown in Algorithm 1.

figure f

2.2 Mutual k-nearest neighbors

Due to the limitations of hMETIS partition algorithm, this paper introduces mutual k-nearest neighbors algorithm to construct mutual k-nearest neighbors graph [24, 25], which is used to directly generate sub-graphs and serve as the initial sub-clusters in the following steps. This section gives the detailed process of constructing k-nearest neighbors graph and introduces how to use mutual k-nearest neighbors algorithm to form sub-graphs.

The k-nearest neighbors of an object are to first calculate the distance between all objects in the data set and the object, and then sort the distance in ascending order. Then, the first k objects are k-nearest neighbors. We suppose that there are n data points X = {x1, x2, …, xn} which have been drawn in M dimensional space. The next step is to construct k-nearest neighbors graph GKNN = (V, E), where V is the set of vertices, each vertex represents a data object, E represents the set of edges, and edges that exist between vertices mean the relationship between data objects. If there is xj in the k-nearest neighbors of xi, an edge is added between the two vertices, and the weight of the edge is the reciprocal of the distance between vertices. In this paper, the distance between points is obtained by calculating Euclidean distance. Next, we introduce the derivation of mutual k-nearest neighbors by k-nearest neighbors.

As the name suggests, for mutual k-nearest neighbors, it should be satisfied if and only if two points are each other’s k-nearest neighbors. Thus, the mutual k-nearest neighbors graph GMKNN = (V, E') is changed on the basis of the k-nearest neighbors GKNN. If there is xj in the k-nearest neighbors of xi, and there is xi in the k-nearest neighbors of xj, then an edge will be added between the two vertices. Then, E'represents the set of edges in the mutual k-nearest neighbors graph, and the weight is still the reciprocal of the distance between the two points. In our algorithm, it is not necessary to calculate the weight of the edge in GMKNN. Obviously, there are fewer edges in the mutual k-nearest neighbors graph, and it is easy to form more unconnected graphs. As shown in Fig. 2, the original data in 2D Fig. 2a, 3-nearest neighbors graph Fig. 2b and mutual 3-nearest neighbors graph Fig. 2c are shown. It can be seen from Fig. 2 that the relationship between k-nearest neighbors graph is very complicated, while the mutual k-nearest neighbors graph is concise and easy to form several unconnected sub-graphs. In the algorithm proposed in this paper, each unconnected subgraph in the graph is regarded as the initial subcluster.

Fig. 2
figure 2

Original data in 2D and nearest neighbors graph: a Original data in 2D, b 3-nearest neighbors graph, c mutual 3-nearest neighbors graph

2.3 MC modularity

In [26], the concept of modularity is proposed for the first time, and modularity is mainly used to measure the quality of community detection algorithm. The greater the modularity is, the more reasonable the community division is. With more and more scholars studying modularity, this concept is used in the process of community division to determine the division results [27]. Modularity represents the difference between the actual ratio of edges and the expected ratio of random edges between nodes in the community. The greater the difference is, the closer the nodes in the community are. Therefore, the higher the modularity value of a community obtains, the better the community structure shows. In the merging stage, the algorithm proposed in this paper combines this idea to determine the final clustering results. Each initial sub-graph of the algorithm is regarded as a cluster, similar to a community and the modularity of the cluster after each merging is calculated. The merging with the largest modularity is the corresponding clustering result.

However, when the sizes of communities are different greatly, the traditional modularity is not consistent with the actual natural communities. So our algorithm uses the improved modularity based on coupling degree (called as MC modularity) [28], to determine the clustering results. As the name implies, MC modularity is based on the modularity of coupling degree between clusters. It measures the clustering results according to the characteristics that the nodes inside clusters are closely related and the connections between clusters are relatively sparse. The following is an introduction to MC modularity:

  • Definition 1 number of edges in the cluster. Suppose a data set has m sample points, C = {C1, C2, …, Cn} represents the corresponding n subgraphs of the data set i.e. n clusters, and ni is the number of vertices in the ith cluster. V and W are the two vertices in the cluster. If there is edge between vertices, Avw = 1; if there is no edge connection between vertices, Avw = 0. The number of sides in a cluster is defined as follows:

$$ e\left({C}_i\right)=\sum \limits_{v,\mathrm{w}\in {C}_i}{A}_{vw} $$
(4)
  • Definition 2 number of edges among clusters. According to the basic concept in definition 1, the number of edges between clusters Ci and Cj is defined as:

$$ e\left({C}_i,{C}_j\right)=\sum \limits_{v\in {C}_i,w\in {C}_j}{A}_{vw} $$
(5)
  • Definition 3 connection density in the cluster. The connection density D(Ci) in cluster Ci is defined as:

$$ D\left({C}_i\right)=\frac{2e\left({C}_i\right)}{n_i\ast \left({n}_i-1\right)} $$
(6)
  • Definition 4 connection among clusters. The connection density L(Ci, Cj) between cluster Ci and Cj is defined as:

$$ L\left({C}_i,{C}_j\right)=\frac{e\left({C}_i,{C}_j\right)}{n_i\ast {n}_j} $$
(7)
  • Definition 5 coupling degree among clusters. The coupling degree C(Ci, Cj) between clusters Ci and Cj is defined as:

$$ C\left({C}_i,{C}_j\right)=\frac{L\left({S}_i,{S}_j\right)}{D\left({S}_i\right)+D\left({S}_j\right)+1} $$
(8)
  • Definition 6 MC modularity [24]. The MC modularity MC(C1, C2, …, Cn) of clusters C1, C2, …, Cn is defined as:

$$ MC\left({C}_1,{C}_2,\dots, {C}_{\mathrm{n}}\right)=\left\{\begin{array}{l}1\hbox{-} \frac{2}{\mathrm{n}\ast \left(\mathrm{n}\hbox{-} 1\right)}\sum \limits_i\sum \limits_{j>i}C\left({C}_i,{C}_j\right),n>1\\ {}1,n=1\end{array}\right. $$
(9)

where \( \frac{2}{n\ast \left(n-1\right)}\sum \limits_i\sum \limits_{j>i}C\left({C}_i,{C}_j\right) \) is the average coupling degree among all clusters. The lower the average coupling degree is, the better the clustering effect is. Therefore, we hope that the larger the MC modularity is, the better the clusters will be.

3 MChameleon algorithm

The Chameleon algorithm based on mutual k-nearest neighbors (MChameleon) skips the process of graph partition, does not use hMETIS algorithm, and does not need to specify the number of sub-clusters, which improves the accuracy of the algorithm to a certain extent. In our algorithm, the concept of MC modularity is introduced to objectively identify the clustering results, which overcomes the shortcomings that Chameleon algorithm needs to specify the number of clustering clusters. It can be seen from Algorithm 1 that the original Chameleon algorithm requires up to four user-specified parameters, which were precisely sensitive and easily affected the final clustering result. Generally speaking, the greatest advantage of our algorithm is to reduce the number of user-specified parameters without decreasing the accuracy.

MChameleon is also divided into three parts. Firstly, the data set is constructed into 2 k-nearest neighbors graph G2KNN, where 2 k is mainly relative to k in the mutual k-nearest neighbors. In the k-nearest neighbors graph, as the value of k becomes larger, the more adjacent points are connected, which indicates that the relationship between the object and other objects is strengthened. This is because, k is a natural number, there is 2 k > k, so the relationship between 2 k-nearest neighbors graph is stronger than that of k-nearest neighbors graph. In order to generate as many sub-graphs as possible, k will take small value when constructing mutual k-nearest neighbors graph. In this case, only k-nearest neighbors graph to represent the relationship between data will be relatively weak, so when constructing the nearest neighbors graph, we need to construct 2 k-nearest neighbors graph, of course, we can also construct 3 k-nearest neighbors graph, 4 k-nearest neighbors graph and so on. In this paper, the 2 k-nearest neighbors graph G2KNN is constructed. Then, the mutual k-nearest neighbors graph is constructed, and the process of constructing G2KNN and GMKNN is similar to that described in Section 2.2. Secondly, the small subgraph generated in the mutual k-nearest neighbors graph is used as the initial subcluster. If there is only one node in the cluster, in order to improve the efficiency of the algorithm, we will first process it. We classify this single node as the one to which the nearest node belongs cluster. Finally, based on the relative interconnectivity and relative closeness to calculate the similarity between clusters, we select the two clusters with the maximum similarity i.e. both relative interconnectivity and closeness are high to merge. At the same time, we calculate the MC modularity value according to the MC modularity formula (9) to find the best clustering result automatically. This is because, the relative interconnectivity and relative closeness also use the variables calculated by the hMETIS algorithm to obtain the minimum bipartite graph, the two indexes are also unstable. In MChameleon algorithm, the modified formulas of relative interconnectivity and relative closeness are used, which are as follows:

The relative interconnectivity is shown in the formula (10):

$$ RI\left({C}_i,{C}_j\right)\hbox{'}=\frac{E\left({C}_i,{C}_j\right)}{\frac{\left(\overline{S}E\left({C}_i\right)+\overline{S}E\left({C}_j\right)\right)\ast \mid E\left({C}_i,{C}_j\right)\mid }{2}} $$
(10)

where E(Ci, Cj) is the sum of the weight of the edges that straddle clusters Ci andCj, ∣E(Ci, Cj)∣ is the number of the edges that connect between clusters Ci and Cj, \( \overline{S}E\left({C}_i\right) \) and \( \overline{S}E\left({C}_j\right) \) are the average weights of the edges within each cluster, respectively.

The relative closeness is shown in the formula (11):

$$ RC\left({C}_i,{C}_j\right)\hbox{'}=\frac{\overline{S}E\left({C}_i,{C}_j\right)}{\frac{\mid {C}_i\mid }{\mid {C}_i\mid +\mid {C}_j\mid}\ast \overline{S}E\left({C}_i\right)+\frac{\mid {C}_j\mid }{\mid {C}_i\mid +\mid {C}_j\mid}\ast \overline{S}E\left({C}_j\right)} $$
(11)

where \( \overline{S}E\left({C}_i,{C}_j\right) \) is the average weights of edges that connected between clusters Ci and Cj, ∣Ci∣ and ∣Cj∣ are the number of nodes in the respective clusters Ciand Cj, \( \overline{S}E\left({C}_i\right) \) and \( \overline{S}E\left({C}_j\right) \) still are the average weights of the edges within each cluster, respectively.

The similarity formula is shown in (12):

$$ Sim\left({C}_i,{C}_j\right)= RI\left({C}_i,{C}_j\right)\hbox{'}\ast RC\left({C}_i,{C}_j\right){\hbox{'}}^{\alpha } $$
(12)

where α is a user-specified parameter to balance the importance of relative interconnectivity and relative closeness.

In order to understand the whole algorithm process more concisely, the flow chart of MChameleon algorithm is shown as follows (Fig. 3):

Fig. 3
figure 3

Flowchart of MChameleon algorithm

The overall process of the Chameleon algorithm based on mutual k-nearest neighbors (MChameleon) is shown in Algorithm 2.

figure g

Analysis of computational complexity: The computational complexity of MChameleon algorithm mainly includes constructing 2 k-nearest neighbors graph, constructing mutual k-nearest neighbors graph, calculating the similarity between clusters, and merging clusters. Suppose that the sample size of a data set is n, the Euclidean distance between the samples need to be calculated first in the process of constructing the k-nearest neighbors graph. The complexity of calculating all pairwise distance is O(n2). We use the quick sort algorithm to sort the distance, and its average complexity is O(nlogn). Therefore, the complexity of constructing k-nearest neighbors graph is O(n2) + O(nlogn). For the construction of mutual k-nearest neighbors graph, which is a comparison process based on k-nearest neighbors graph. The time complexity of comparison is O(n), so the total complexity of constructing mutual k-nearest neighbors graph is O(n2) + O(n). Suppose that the number of initialized sub-clusters is m (m < <n), the complexity of calculating the similarity value between paired clusters is O(m2). In the merged cluster stage, it is also a typical agglomerative hierarchical clustering, and its complexity is O(m2).

Therefore, the main computational complexity of the entire Algorithm 2 is O(n2) + O(nlogn) + O(n2) + O(nlogn) + O(n) + O(m2) + O(m2)~O(n2) + O(m2). Since m < < n, O(m2) can be ignored, so the final computational complexity is O(n2), which is the same as the complexity of Chameleon algorithm.

4 Experiments and analysis

This section mainly verifies the practicability and effectiveness of the MChameleon algorithm proposed in this paper. Our experiments consist of four sections. Section 4.1 mainly describes the data sets used in the experiment and experimental method. Section 4.2 shows the categories predicted by MC modularity and gives some comparison and analysis. Section 4.3 is about the experimental results and analysis for artificial data sets with low dimensions. Section 4.4 presents the evaluation of experimental results for UCI datasets with different dimensions. Experiments are performed to test the performance of the algorithm.

4.1 Experimental description

In order to show the clustering performance of MChameleon algorithm, the data sets used in this paper include six classic artificial data sets and eight UCI data sets. The experimental data characteristics are shown in Tables 1 and 2.

Table 1 Characteristics of artificial data sets
Table 2 Characteristics of UCI data sets

Among them, the shape of six artificial data sets has its own characteristics. The dimension includes not only the commonly used two-dimensional data sets, but also two three-dimensional ones, which increases the diversity of the artificial data sets. Figure 4 shows the visualization of the artificial data sets. In the eight UCI data sets, Iris, Wine, Seeds, Balance Scale and Wireless Indoor Localization are low-dimensional data sets, and Soybean, Sonar, and Ionosphere are relatively high-dimensional small-scale data sets. The clustering ability of the MChameleon algorithm is tested on different high- and low-dimensional data sets.

Fig. 4
figure 4

Visualization of artificial data sets

The performance metrics of the clustering results are various, roughly divided into two categories: internal index and external index. For the evaluation of a clustering result, it is also crucial to choose a clearly identified indicator. The evaluation index used in the experiments is the accuracy of clustering [29], which can measure the correct ratio of the clustering result to the actual result. The formula is denoted as follows:

$$ Accuracy=\frac{1}{n_{sample}}\sum \limits_{i=1}^{n_{sample}}\sigma \left({y}_i,{\hat{y}}_i\right) $$
(13)

where nsample is the number of samples in the data set, yi is the true cluster label of the ith sample, and \( {\hat{y}}_i \) is the predicted cluster label, that is, the label obtained by the clustering algorithm. \( {\hat{y}}_i \) After label reordering, if the value of yi and \( {\hat{y}}_i \) are equal, \( \sigma \left({y}_i,{\hat{y}}_i\right)=1 \), otherwise \( \sigma \left({y}_i,{\hat{y}}_i\right)=0 \). The accuracy range is [0,1]. The closer it is to 1, the more correct the result is and the better the clustering effect is.

For parameter setting, the experiment only needs to input one parameter, that is, k in mutual k-nearest neighbors. The value of k in the mutual k-nearest neighbors generally needs to be a small number, such as 2, 3, and 4. The smaller the k value is, the easier it is to form small clusters. α in formula (12) defaults to 2. By running algorithm 2, we default that when MC modularity is the highest, it is the final clustering result corresponding to data set. Thus, the final number of clusters does not need to be specified.

We demonstrate the feasibility of MChameleon algorithm through comparative experiments. Compare MChameleon algorithm with original Chameleon [16] algorithm, improved AChameleon [19] and some other classic clustering algorithms (K-Means [30], DBSCAN [11], BRICH [31]) to obtain the accuracy of the algorithm in the best case and compare their differences. For each of these algorithms, we ran these algorithms more than 10 times in data sets and selected the one with the best accuracy as the final result.

We do experiments on a computer with a core i5 3.40 GHz processor, Windows 10 operating system and 8GB RAM by running PyCharm 2019.

4.2 Categories predicted by MC modularity

This section will show the clustering results objectively identified by MC modularity. The clustering result can be found automatically by MC modularity, which can avoid subjective judgment. Next, take the artificial data set Jain as an example, and use the MC modularity to find the number of clusters that Jain should be divided into. The result is shown in Fig. 5.

Fig. 5
figure 5

MC modularity broken line

It can be seen from the relationship between the cluster number and the MC modularity broken line in Fig. 5 that when the cluster number is 2, it corresponds to the maximum value of the MC modularity. Therefore, the MC modularity determines that the number of clusters in the Jain dataset is 2. The actual classification of the data set is 2 categories, so it is in line with the results, and the preliminary judgment is that the method of determining the clustering result using MC modularity is feasible. For further analysis, through this method, we can find the number of clusters determined by the MC modularity of all data sets in this paper. The specific results are shown in Table 3.

Table 3 Categories of MC modularity prediction

In Table 3, through comparison, it is found that the difference between categories predicted by MC modularity and actual categories is very small. In the artificial data sets, the number of clusters of 3-Spiral, Jain, Target, Atom and Chainlink data sets obtained through MC modularity is exactly the same as the actual number of clusters. In the UCI data sets, the number of clusters predicted by the MC modularity in the Soybean, Iris, Wine, Sonar and Ionosphere data sets is also the same as the actual number of clusters. Only the predicted categories of the Compound and Seeds data sets are different from the actual categories, but they are also very close. The actual category of Compound data set is six, while the MC modularity predicts five categories, only one difference. The number of clusters predicted by MC modularity of Seeds data set is only one difference from the actual number of clusters. Therefore, MC modularity can be used to identify cluster results. In the later algorithm result analysis, the accuracy obtained by this algorithm is the corresponding clustering results in Table 3.

4.3 Experimental analysis on artificial data sets

In order to prove that MChameleon is also effective in the face of diverse data sets, we chose to experiment on six different characteristics of artificial data sets. The first data set, 3-Spiral is composed of three helical datasets in different directions. The second data set, Jain consists of two arc datasets with different densities in two opposite directions. The third data set, Compound is six clusters with different shapes, densities and sizes, and the gap between clusters is very close. The fourth data set, Target is a large hollow circle data, including a solid circle data, with small clusters at four corners. The fifth data set, Atom is composed of two spherical like spatial data with different variances and linear not separable. In last data set, Chainlink is two ring data staggered in space. In Fig. 6, the clustering results of artificial data sets are shown, with different colors representing different clusters. Table 4 shows the Accuracy values obtained by the algorithm MChameleon and other clustering algorithms on artificial data sets.

Fig. 6
figure 6

MChameleon clustering on six artificial data sets: a MChameleon clustering results of “3-Spiral” data, b MChameleon clustering results of “Jain” data, c MChameleon clustering results of “Compound” data, d MChameleon clustering results of “Target” data, e MChameleon clustering results of “Atom” data, f MChameleon clustering results of “Chainlink” data

Table 4 Clustering accuracy (%) of algorithms on different artificial data sets

The dimensions of these six artificial datasets are relatively low (2D or 3D). From the results in Table 4, it can be seen that the accuracy of this algorithm is relatively good and much higher than that of Chameleon algorithm, improved AChemeleon algorithm and other classic algorithms. Accuracy on the five data sets of 3-Spiral, Jain, Target, Atom and Chainlink all achieves 1, that is, completely correct clustering results. The corresponding clustering results of these five data sets are shown in Fig. 6a, b, d, e, and f, from which we can see that the clustering color classification is clear. However, other algorithms can only perform well on individual data sets. In Compound data set, the accuracy of clustering is not as good as that of the other five data sets, because the clustering results predicted by MC modularity are different from the actual result. It’s worth mentioning that compared with the improved AChameleon and K-Means algorithm, MChameleon algorithm is also the best in Compound data set. Generally, it can be seen that the algorithm presented in this paper performs well in the clustering results of arbitrarily shaped clusters on the data sets with lower dimensions, which proves the superiority of the algorithm presented in this paper.

4.4 Experimental analysis on UCI data sets

Since, the artificial data sets used in Section 4.3 are all low-dimensional, the UCI data set dimensions we choose include low and high dimensions, which is more illustrative. The UCI data sets used in the experiments are all from the UCI Machine Learning Repository. The higher dimensions in the UCI data sets are Soybean, Sonar, and Ionosphere. The Soybean data set contains 47 points, 35 attributes and 4 clusters. The Sonar dataset contains 208 data points with an attribute dimension of 60, and the number of actual categories is 2. The Ionosphere data set consists of 351 points, 34 attributes and 2 categories. The other five data sets are lower dimension datasets. Iris data set has 4 attributes, that is, four-dimension datasets. Wine data set has 13 attributes, Seeds data set has 7 attributes, Balance Scale data set has 4 attributes and Wireless Indoor Localization has 7 attributes. The accuracy obtained by the algorithm MChameleon and other clustering algorithms on the UCI data set are shown in Table 5.

Table 5 Clustering accuracy (%) of algorithms on different UCI data sets

As can be seen from Table 5, on these high-dimensional data sets, such as Soybean data set, Sonar data set and Ionosphere data set, although the accuracy rate is not all optimal, it is more accurate than the clustering results of the original Chameleon algorithm and the improved AChemeleon algorithm. Especially, on the Sonar data set, the clustering accuracy of MChameleon algorithm is the best. Better yet, for the Iris, Wine and Balance Scale data sets, the accuracy of the MChameleon algorithm is the highest, and it is much higher than the density clustering DBSCAN algorithm. For the data set with large number of samples, the accuracy of Wireless Indoor Localization is mediocre, but it is still superior to original Chameleon algorithm and AChameleon algorithm. In the Seeds dataset, the clustering effect of the algorithm in this paper is similar to that of the original Chameleon algorithm, but the advantage of MChameleon algorithm is that it does not require the use of partitioning technology, and it can automatically determine the number of clusters, so MChameleon is a better choice than Chameleon for clustering algorithm.

In general, the experimental results show that the proposed algorithm in this paper can not only find clusters with intentional shapes, but also achieve good clustering effect in low-dimensional data sets. Compared with other algorithms, it also has more advantages in high-dimensional datasets, and the results in many datasets are satisfactory.

5 Conclusions

In Chameleon algorithm, there are too many user-specified parameters and the used partition algorithm brings certain uncertainty, which affects the final clustering result. In this paper, a Chameleon algorithm based on mutual k-nearest neighbors, namely MChameleon algorithm, is proposed to solve the problem of too many parameters in Chameleon algorithm to a great extent. MChameleon uses the idea of mutual k-nearest neighbors to directly omit the division process, avoid the problem of algorithm uncertainty, and improve the accuracy of the algorithm to a certain extent. In this paper, experiments on classical artificial data sets and UCI real data sets prove that this algorithm is superior to the traditional Chameleon algorithm and its improved algorithm AChameleon. It not only reduces the input parameters of the original algorithm, but also does not reduce the accuracy of the algorithm, and even improves the accuracy of clustering in many data sets.

However, the time cost of this algorithm is very high when processing large-scale high-dimensional data, which is also the inherent defect of hierarchical clustering. At the same time, because MChameleon algorithm does not involve the processing of noisy data, its clustering effect is general when processing some noisy data sets, which is also a common problem of hierarchical clustering, so its accuracy needs to be further improved.

Therefore, it needs further research on how to handle noise data and its application in massive high-dimensional data.