Abstract
Identifying clusters of arbitrary shapes and constantly processing the newly arrived data points are two critical challenges in the study of clustering. This paper proposes a dynamic weight and density peaks clustering algorithm to simultaneously solve these two key issues. An online–offline framework is used, creating and maintaining micro-clusters in the online phase, and treating the micro-clusters as pseudo-points to form the final cluster in the offline phase. In the online phase, when a new data point is merged into the corresponding micro-cluster, a dynamic weight method is proposed to update the weight of the micro-cluster according to the distance between the point and the center of the micro-cluster, so as to more accurately describe the information of the micro-cluster. In the offline phase, the density peak clustering algorithm is improved, natural neighbors are introduced to adaptively obtain the local density of the data point, and the allocation process is improved to reduce the probability of allocation errors. The algorithm is evaluated on different synthetic and real-world datasets using different quality metrics. The experimental results show that the proposed algorithm improves the clustering quality in both static and streaming environments.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The advancement of computer hardware and the continuous development of the sensor technology makes the network full of a large number of high-speed data streams. Due to the low value of big data, labeling the entire dataset is very expensive low-profit task. Therefore, the use of unsupervised methods to mine valuable information in data streams has gradually become a hotspot in the field of data mining [1]. As an unsupervised data mining technology, clustering aims to partition data points into different clusters according to the similarity between data samples. The data in the same category should be as similar as possible, and those in different categories should be as different as possible [2]. The clustering techniques are generally divided into five categories: partition-based clustering [3, 4], density-based clustering [5, 6], hierarchy-based clustering [7, 8], grid-based clustering [9, 10] and model-based clustering [11, 12]. The density-based clustering technology can handle clusters of arbitrary shape and can identify outliers, has gradually becomes a research hotspot [13, 14]. However, these types of algorithms are static and cannot handle newly arrived data points. As a new branch research field in clustering, data stream clustering can incrementally process new data in the streaming environment. It has been gradually applied in several application domains such as medical care and transportation [15,16,17]. In general, data stream clustering algorithms consist of two phases: online and offline. In the online phase, the summary information of the data is incrementally maintained, while in the offline phase, the final cluster is generated using the data summary information stored from the online phase. The follow-up research is also improved based on two phases.
-
(1)
In the studies on the online phase, there are two main data structures to save the summary information of data: grid and micro-cluster. The D-stream algorithm is a representative grid-based data stream clustering algorithm. It maps data to the corresponding grids, update information such as grid density, and connects adjacent high-density grids to form the final clusters [18]. Amineh et al. [19] propose a new grid similarity calculation method in order to improve the ability of grid clustering to deal with multi-density datasets. However, the grid-based algorithms are sensitive to parameters, such as the grid granularity, and may consume lot of space when dealing with high-dimensional datasets. Micro-clusters are formed by a group of data points close to each other. Therefore, the spatial complexity of using the micro-cluster structure is lower than that of the grid. Clustream first applies the micro-cluster to store data summary information [20]. Denstream then further subdivides the micro-clusters into potential core micro-clusters (PMC) and outlier micro-clusters (OMC), according to the micro-cluster weights. The data maintained by the PMC are used for macro-clustering in the offline phase, and the OMC temporarily stores outliers, which enhances the adaptability to the data stream environment [21]. In addition, when a new data point arrives, it requires to merge the point into the specified micro-cluster and adjust the micro-cluster information. Some researchers introduce the idea of fuzzy clustering into the data stream. It selects several micro-clusters closest to the new point in order to calculate the membership value, which makes the clustering result more reliable [22, 23]. Some improvements for Denstream have also improved the clustering performance by adding prior knowledge [24] or review stage [25]. However, the main focus of the current algorithms is to which micro-cluster the new data points should be assigned. As long as the allocation is completed, the weight of data points is the same, no matter where they fall in the micro-cluster. In other words, the weight of a micro-cluster is only related to the number of data points in this micro-cluster. This results in two micro-clusters with different distributions but the same number of points having the same weight. This cannot accurately describe the information of micro-clusters, and even affect the final clustering results.
-
(2)
In the studies on the offline phase, most of the algorithms use variants based on the K-means or DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. However, the partition-based algorithm cannot accurately deal with non-spherical datasets. The DBSCAN algorithm can solve clustering with arbitrary shapes. However, it may not be able to accurately obtain the correct number of clusters, and it runs slowly on large-scale datasets. The density peaks cluster (DPC) algorithm can quickly find the cluster center and complete the clustering task, which overcomes these disadvantages [26]. However, DPC also has some limitations. First, it does not consider the neighborhood information of the data when calculating the local density. In addition, the “domino effect” may occur during the allocation phase, where one allocation error leads to more errors. Some improved algorithms introduce the k-nearest neighbor (KNN) to consider the neighbor information of data. The FKNN-DPC first uses the breadth-first search method to assign points, then uses the KNN method to assign the remaining points [27]. The shared nearest-neighbor-based clustering algorithm (SNN-DPC) uses shared neighbors to describe the data [28]. The method based on the density backbone and fuzzy neighborhood (DBFN) uses the KNN method to assign the boundary points to the related density backbone area [29]. However, it is challenging to choose the optimum value of k in the KNN method. When the value of k is too small, the neighborhood relationship is ignored, and the data points having the same properties are misclassified into different clusters. On the contrary, when the value is too large, data points with different properties are misclassified into the same cluster.
To solve these problems, a dynamic weight and density peaks clustering algorithm is proposed for the data stream (DWDP-Stream). This algorithm improves the online–offline phase to further improve the clustering performance. The main contributions of this paper are summarized as follows:
-
1.
In the online phase, a dynamic weight method is proposed to update the micro-cluster information. The initial weight of the new data point is calculated by measuring the distance between the data point and the corresponding center of the micro-cluster. The closer the distance, the heavier the weight. This method can accurately describe the center, weight, radius and other parameters of the micro-cluster, so as to more accurately describe the micro-cluster.
-
2.
In the offline phase, an improved density peak clustering algorithm is proposed to complete the final clustering task. The algorithm first calculates the local density and other corresponding values of the micro-cluster, based on the natural neighbor relationship. Since the natural neighbor relationship can be obtained from the dataset self-adaptively, there is no need to set parameters in advance. A two-step allocation strategy is then proposed. The first step is to spread the label from the center to its dense natural neighbors, in order to form the cluster prototype. In the second step, the remaining points belong to the cluster where their superior points are located. Since the clustering prototype has been formed, it does not require too many times to find the superior point, which reduces the probability of generating errors.
-
3.
This algorithm can recognize arbitrary shape clustering and continuously process newly arrived data points. It can operate in both static and stream environments.
The remainder of this paper is organized as follows. Section 2 introduces the relevant concepts of the method. Section 3 details the proposed DWDP-Stream method. In Sect. 4, the experimental results are presented and analyzed. Finally, the conclusions are drawn in Sect. 5.
2 Related Work
This section first introduces the micro-cluster structure, which is the data structure used to maintain data summary information in the online phase. Afterward, since this study proposes an improved DPC algorithm based on natural neighbors to complete the final clustering task in the offline phase, the DPC algorithm is briefly reviewed and the related work of natural neighbors is introduced.
2.1 Micro-clusters
Micro-clusters are used to store summary information of data in the online phase. The number of micro-clusters is much smaller than the amount of data, which reduces the processing time of the data. The micro-cluster structure proposed by Cao et al. [21] is used. It further divides the micro-clusters into potential core micro-clusters (PMC) and outlier micro-clusters (OMC). At the same time, recent data objects reflect the current data distribution more accurately than historical data. Therefore, the decay function \(f\left(t\right)= {2}^{-\lambda t}\) is used to capture the evolution of the data stream, where t is the current time and λ is the decay factor (\(\lambda >0\)).
Definition 1
(Potential Core Micro-cluster (PMC)) A potential core micro-cluster (PMC) is a set of approach points. Their arrival timestamps \({\mathrm{T}}_{i1},\dots ,{T}_{in}\) at time t \({\mathrm{p}}_{i1},\dots ,{p}_{in}\) is defined as \(\mathrm{PMC }\{\overrightarrow{{\mathrm{CF}}_{1}},\overrightarrow{{\mathrm{CF}}_{2}},\upomega \}\), where \(\overrightarrow{{\mathrm{CF}}_{1}}={\sum }_{i=1}^{n}f\left(t-{T}_{i}\right){p}_{i}\) represents the linear sum of all the points in the micro-cluster, \(\overrightarrow{{\mathrm{CF}}_{2}}={\sum }_{i=1}^{n}f(t-{T}_{i}){{p}_{i}}^{2}\) represents the sum of squares of all the points in the micro-cluster and \(\omega ={\sum }_{i=1}^{n}f(t-{T}_{i}), \omega \ge \beta \mu\) represents the weight of the micro-cluster. Note that the weight of the PMC should be greater than or equal to \(\beta \mu\), where \(\beta\) is the outlier threshold, \(\mu\) is the integer weight of a given potential core micro-cluster, and both parameters are defined by the user. The center c of the PMC is expressed as:
The radius r of the PMC is given by:
where parameter ϵ is the radius threshold, and the radius of the micro-cluster should be less than or equal to ϵ.
Definition 2
(Outlier Micro-cluster (OMC)) An outlier micro-cluster (OMC) is a set of approach points. Their arrival timestamps \({\mathrm{T}}_{i1},\dots ,{T}_{in}\) at time t \({\mathrm{p}}_{i1},\dots ,{p}_{in}\) is defined by \(\mathrm{OMC}\left\{\overrightarrow{{\mathrm{CF}}_{1}},\overrightarrow{{\mathrm{CF}}_{2}},\upomega \right\}\), where \(\overrightarrow{{\mathrm{CF}}_{1}},\overrightarrow{{\mathrm{CF}}_{2}},\omega ,c,r\) are defined with the same PMC. The difference is that the weight of OMC should be lower than βμ.
As time progresses and a new data point arrives, the role of the micro-clusters may change. An OMC can grow into a PMC if its weight exceeds the threshold after adding a new point. If a PMC has no new points added for a long time and its weight is lower than the specified threshold, it will degenerate into OMC.
2.2 Density Peak Clustering (DPC)
DPC is based on two premises: (1) the local density of the center point is greater than its surrounding points, and (2) the distance between different center points is far. DPC describes a point using the local density ρ and the distance δ to its nearest larger density point:
where n is the data size, \({d}_{ij}\) is the Euclidean distance between points i and j, \({d}_{c}\) is the cutoff distance set by the user, and therefore, the local density of point i is the number of points that are less than the \({d}_{c}\) value from this point. The other δ parameters are computed as:
In other words, it means to find a data point with greater local density closest to data point i. This point is referred to as the superior point. The δ value of the densest point is its distance from the farthest point in space. DPC draws a decision graph with ρ plotted on the X-axis and δ plotted on the Y-axis, to rapidly identify the centers. The center points with large ρ and δ values are distributed in the upper right part of the decision graph. DPC also uses the γ-decision graph. The γ value is defined in Eq. (5). It decays exponentially, and therefore, it is preferable to select the centers.
After the selection, the non-central point is assigned to its superior point’s cluster. If the superior point does not yet belong to a cluster, the algorithm finds the superior of the superior point until the clustering is completed.
2.3 Natural Neighbor
Natural neighbor is a kind of neighbor relationship adaptively obtained by continuously expanding the neighborhood range, which can efficiently solve the problem of neighborhood selection [30]. Natural neighbor relationships are mutual. If every point in a dataset has at least one natural neighbor except for noise points, the dataset is in a stable state. Some related concepts about natural neighbors are explained below.
Definition 3
(K Nearest Neighbor) Assuming that the dataset is X, and the point to be processed is p, the K Nearest Neighbor (KNN) of p is a set of objects q defined as:
where \(dist(p,u)\) represents the distance between point p and the Kth farthest point.
Definition 4
(Reverse K Nearest Neighbor) The Reverse K Nearest Neighbor (RKNN) of p is a set of objects q that treat p as their KNNs:
Definition 5
(Natural neighbor) If the natural neighbor of point p is q, then p and q are each other’s neighbors.
Definition 6
Steady state and natural eigenvalue. A natural neighbor search is a process of gradually expanding the search range until it reaches the stable state. The definition of the steady state is given by:
where p represents the data point in dataset X without noise. The search range of KNN is \(k\epsilon [1,n]\). When the dataset reaches the stable state, the value of k is referred to as the natural eigenvalue, and it is denoted by \(\kappa\):
3 The Proposed Method
To incrementally process data and cluster arbitrary shaped datasets, a dynamic weight and density peaks clustering algorithm is proposed for data stream (DWDP-Stream). The algorithm is improved based on the two-phase framework. Figure 1 shows the framework of the proposed algorithm. The online phase is divided into merging components and pruning components. In the merging component, a dynamic weight method is proposed to update the information of the micro-clusters, and the pruning strategy is also improved. In the offline phase, an improved density peak clustering algorithm is proposed to accurately complete the clustering task. Table 1 shows the notations used in the proposed algorithm.
3.1 Proposed Algorithms for the Online Phase
The online phase consists of two components: merge and prune. For newly arrived data points, the merge component is used to merge them into the micro-cluster and update its information. The pruning component is used to regularly remove micro-clusters with lower weights to ensure memory usage.
3.1.1 Merge Component
In the online phase, PMC and OMC are used to store the summary information of the stream data. When a new data point arrives, the algorithm first tries to add it to the nearest PMC. If the radius of the micro-cluster after insertion does not exceed the specified threshold (ϵ), the micro-cluster information is updated according to the distance between the data point and the center of the corresponding micro-cluster. If the insertion is unsuccessful, it is merged into the nearest existing OMC and the information of the OMC is updated. Otherwise, a new OMC is created for this data point. Algorithm 1 explains the flow of the merge component in detail. When a new data object p arrives, different functions are performed:
-
(1)
The PMC that is closest to the new data point p is first found. The closest PMC is defined by the Euclidean distance between p and the center of the PMC. The nearest PMC is denoted by PMCp.
-
(2)
If the radius of the PMCp after insertion is less than or equal to the specified radius threshold (ϵ), it is merged. The initial weight of the point p is calculated according to the distance between the point and the center of the micro-cluster. The closer the distance, the larger the weight. The maximum weight is 1 and the minimum weight is \({e}^{(\frac{1}{\epsilon +1}-1)}\). This calculation method considers the location information of the data, and therefore, it can more accurately describe the micro-cluster.
$$weight_{p} = e^{{(\frac{1}{{dist\left( {p,center} \right) + 1}}\; - \;1)}}$$(11)
Since the micro-clusters can incrementally process the data, the weight of the points in the micro-clusters is increased.
At the same time, the radius and center of the micro-cluster are updated using Eqs. (1) and (2).
-
(3)
If the insertion is unsuccessful, the OMC that is closest to the new data point p is found and recorded as OMCp.
-
(4)
If the radius of the inserted OMCp is less than or equal to the specified radius threshold (ϵ), it is merged, and its weights, radius and center are updated using operations similar to inserting PMC.
-
(5)
The new weights of the merged OMCp are further checked. If the weight is greater than βμ, OMCp is removed and a new PMC is formed.
-
(6)
Otherwise, a new OMC is created for data point p.
Taking the Aggregation dataset as an example, the micro-cluster distribution and final clustering results obtained by static weighting (the initial weight of the data is 1) and dynamic weighting methods are given, respectively. Figure 2a, b shows the micro-cluster distribution and the final clustering result obtained using the static weight method, respectively. It can be seen that the static weight method is not accurate for the separation of some intersecting clusters, and there are some allocation errors in the final clustering results. Therefore, the static weight method cannot accurately describe the micro-clusters information, which affects the final clustering results. Figure 2c, d shows the results obtained using dynamic weight. Through comparison, it can be deduced that the micro-cluster formed using dynamic weight can better summarize the information of the original dataset and achieve accurate clustering.
3.1.2 Prune Component
The difference between streaming data and static data is that conceptual drift may occur in the stream. That is, over time, the original data may change into noise, and some noise may develop into clusters. For the existing PMC, if no new points are merged into it, its weight gradually decays over time, and even falls below βµ. That is, the micro-cluster is no longer a PMC. For OMC, as new data points are merged, it has the potential to develop into PMC. Thus, the weight of each micro-cluster should be periodically checked at every \({T}_{p}\) time. \({T}_{p}\) is defined as the shortest time for a PMC to degenerate into OMC:
This is determined by \({2}^{-\lambda {T}_{p}}\beta \mu +{e}^{(\frac{1}{\epsilon +1}-1)}=\beta \mu\). This definition allows any degradation of the PMC to be directly detected. The Denstream's pruning operation is also improved. In Denstream, if OMC and PMC are below the corresponding thresholds, they are directly removed. It is reasonable to directly delete OMC, and a lower limit of weight (\(\xi\)) is specified, while any OMC lower than this weight will be deleted.
This is function of \(t\) (i.e., current time) and \({t}_{0}\) (i.e., creation time of the o-micro-cluster). When \(t = {t}_{0}\) (i.e., at the creation time of the o-micro-cluster), \(\xi\) = 1. As time elapses, \(\xi\) increases. That is, the longer an o-micro-cluster exists, the larger its weight is expected to be. However, for the PMCs that have no data incorporated for a period of time, their weight is only temporarily lowered below the threshold, and it should not be directly deleted. This kind of PMCs is temporarily degenerated to OMC, and then deleted if it is still OMC in the next check. This ensures that the addition of new data can make it re-upgraded to PMC at any time in a subsequent period. Algorithm 2 provides an overall introduction to the online phase of the algorithm, including the pruning components.
3.2 Proposed Algorithms for the Offline Phase
In the offline phase, the PMC information maintained in the online phase is processed in order to obtain the final clustering result. Before processing a PMC, it is considered as a pseudo-point with the center of the micro-cluster as the coordinate and the weight of the micro-cluster as the weight. An improved density peak clustering algorithm is proposed to obtain the final clustering results. This method introduces the natural neighbors so that the local density of data points can be adaptively obtained, thereby reducing the number of input parameters. At the same time, a two-stage allocation strategy is proposed to deal with the “domino effect” that often occurs in DPC algorithms. That is, if one point is assigned incorrectly, it may lead to a series of errors. The first step assigns labels to the dense points with larger local densities, thus forming the prototype structure of the cluster. The second step assigns the remaining points to the cluster where the superior point is located. Algorithm 4 introduces the steps in the offline phase.
-
(1)
Run the natural neighbor search algorithm on the PMC. Algorithm 3 introduces the search process of natural neighbors, where \({NaN}_{Edge}\) includes all the natural neighbor pairs in the dataset. If A and B are natural neighbors, they are stored as {A, B}. The count function counts the number of points that have no natural neighbors and assigns this value to cnt. When all the points except the noise find their natural neighbors, the number of points without neighbors remains the same even when the search scope is extended. The repeat function is used to count the number of times cnt remains unchanged. When it is greater than the specified number, the algorithm ends, the natural neighbors of each point are obtained and the search range is specified as the natural eigenvalue.
-
(2)
Calculate the local density. It is considered that the local density of each point is composed of the weighted sum of its weight and the weights of its natural neighbors:
$$\rho_{i} = { }\omega_{i} + \mathop \sum \limits_{j \in nn(i)} \frac{1}{{(d\left( {i,j} \right) + 1)^{2} }} \times \omega_{j} { }(\omega_{j} < \omega_{i} )$$(15)where nn(i) represents the natural neighbor of point i. The neighbors whose weights are lower than i are considered for calculation. The more neighbors and the closer they are, the larger the local density, and vice versa.
-
(3)
Subsequently, for each point, the algorithm finds the nearest point with the higher local density. The distance between these two points is the δ value of the point, which is calculated using Eq. (4)
-
(4)
After obtaining ρ and δ of each data point, Eq. (5) is used to calculate γ. c points with higher γ values are then considered as the centers.
-
(5)
In the first step of the allocation strategy, the dense points are first selected. The global natural eigenvalue \(\kappa\) is obtained using Algorithm 3. The points whose natural neighbors are greater than or equal to a certain ratio of natural eigenvalue are considered as dense points:
$$Dense = \{ x_{i} \in X|nn(i) \ge \alpha \times \kappa \}$$(16)These dense points are assigned to form the cluster prototype. The label of the center point is first assigned to the dense points in its natural neighbors. The labels to the natural neighbor dense points of the natural neighbors are then assigned, and this process is repeated until all the dense points are allocated.
-
(6)
In the second step of the allocation strategy, the idea of DPC is used: the label of the remaining point belongs to its superior point. Since the cluster prototype has been formed, the superior point can be found without too many iterations, which can efficiently reduce the probability of the "domino effect".
4 Experimental Results and Analysis
4.1 Dataset and Experiment Description
Some synthetic datasets and UCI real datasets are used to compare the efficiencies of the algorithms in static and streaming environments. Table 2 presents the dataset in the static environment. In the static environment, the entire dataset is directly loaded into memory without decaying, and the obtained results are compared with the original K-Means, DBSCAN, DPC and the improved DPC algorithm (FKNN [27], SNN-DPC [28], DBFN [29]). In the streaming environment, the KDDCUP99 network intrusion dataset and Covertype dataset are used as benchmarks to evaluate the performance of the algorithm. These datasets are often used to test the performance of stream clustering algorithms. For the KDDCUP99 dataset which contains 41 attributes, 34 numerical attributes are considered for processing. Each record in the Covertype dataset describes a defined forest area, and all the attributes are numeric types. The Denstream [21], D-stream [18] and MuDi-stream algorithms [19] are used as comparison algorithms with 1000 data points in each group, and the clustering results of these algorithms are evaluated on each group of data. All the experiments are performed on Inter(R) i5-8500 16G RAM, WIN10 64bit OS.
4.2 Evaluation Metrics
The evaluation metrics used in the static environment include the clustering Accuracy (Acc), Normalized mutual information (NMI) and Adjusted rand index (ARI). In the streaming environment, the Rand index (RI) and Purity are used for evaluation.
The clustering accuracy is expressed as:
where n represents the data size,\(r_{i}\) and \(s_{i}\), respectively, represent the resulting label and the ground truth. The resulting label is mapped to the ground truth using the function and the Hungarian algorithm.
Normalized mutual information (NMI) is an information-theoretic measure used to evaluate the similarity of the two clusters. The normalized mutual information \(\mathrm{NMI}\left(A,B\right)\) is computed as:
where CA and CB, respectively, represent the data points in clusters A and B, \({C}_{i}\) denotes the data point in cluster A, \({C}_{j}\) is the data point in cluster B, and \({C}_{ij}\) represents the common data points in clusters A and B.
The Rand index indicates the ratio of the correct allocation, where TP is the number of data pairs that belong to the same cluster class in both clusters A and B, and TN is the number of data pairs that do not belong to the same cluster class in both clusters A and B.
The adjusted rand index (ARI) is an improvement of RI:
where \({n}_{i}\) represents the data point in cluster A, \({n}_{j}\) is the data point in cluster B and \({n}_{ij}\) denotes the common data points of clusters A and B.
The purity is defined based on the class which is most frequent in it divided by the number of data points in this cluster:
where k represents the number of clusters, \({n}_{i}\) is the number of the points in cluster i, and \({n}_{ij}\) denotes the number of the points in both clusters i and j.
4.3 Experimental Results
4.3.1 Clustering Results in the Static Environment
5 synthetic datasets and 4 UCI datasets are used to test the performance of the proposed algorithm in the static environment. These datasets have different distributions and sizes so that different situations can be simulated. The Acc, NMI and ARI of the proposed and comparative algorithms are shown in Tables 3, 4 and 5, respectively. It can be seen that the proposed algorithm outperforms the other methods using the Aggregation, R15, D31 and Iris datasets. There is no allocation error in the Aggregation, and all the metrics reach 1. On the 4k2-far dataset, except for the DPC algorithm, all the other algorithms accurately complete the clustering task. The proposed algorithm has the highest Acc metric on the Libras Movement dataset and the highest ARI metric on the E-coli dataset. On the remaining datasets, although some evaluation metrics are not higher than those of the other algorithms, the proposed algorithm is not far behind them. The mean and variance of Acc are computed for each algorithm. The obtained results are presented in Table 6. It can be seen that the proposed algorithm has the highest average precision and the smallest variance, and it is the most stable method.
The Aggregation dataset is chosen to visually illustrate the clustering results of different algorithms. The final clustering result is shown in Fig. 3. It can be seen from Fig. 3a that the proposed algorithm assigns the correct labels to all the points, and has the highest performance. Figure 3b~g shows the results obtained using other comparison algorithms. K-Means incorrectly divides a large cluster into three clusters. DBSCAN cannot predict the correct number of clusters. The DPC, FKNN, SNN and DBFN algorithms can accomplish basic clustering tasks. However, there are still some allocation errors that affect the clustering performance. For example, in Fig. 3f, the three clusters in the lower left corner have assignment errors.
4.3.2 Clustering Results in the Streaming Environment
The KDDCUP99 network intrusion dataset and Covertype dataset are used to test the performance of the algorithm in the streaming environment. The first 1000 data points of the dataset are used to form the initial micro-cluster without a clustering task. Starting from the 1001st data point, 1000 samples are read from the stream each time, the current result of the cluster set is updated with these new points and this process is continuously repeated.
81,000 data points are selected from the KDDCUP99 dataset for evaluation. Figures 4 and 5, respectively, show the changes in clustering Purity and RI. It can be seen from Fig. 4 that the purity of Denstream, D-stream and MuDi-stream algorithms fluctuates several times with the continuous arrival of data. Especially when processing the 52nd group of data, the purity of the algorithm decreases to varying degrees, and the Denstream algorithm decreases to 0.81 at most. Based on the comparison, the purity of the proposed algorithm is almost 1, which is only slightly deficient in the processing of group 6 and group 38 data, that are, respectively, 0.98 and 0.996, far exceeding the clustering performance of other algorithms. It can be seen from Fig. 5 that the proposed method outperforms the other algorithms in most processing units. Even if it decreases, the lowest RI is 0.4787, which is also higher than 0.4780 for the MuDi-stream algorithm, which is much higher than 0.2682 for the Denstream algorithm, and 0.3873 for the D-stream algorithm. Statistics are made on the average purity and average RI of the algorithm. It can be seen from Table 7 and Fig. 6 that the proposed algorithm has improved in both purity and RI. The average purity of the proposed algorithm is very close to 1. The RI is 8% higher than that of the D-stream algorithm, and the maximum improvement of the MuDi-stream algorithm is 26%.
31,000 data points are selected from the Covertype dataset for evaluation. Figures 7 and 8, respectively, show the changes in different algorithms’ purity and RI. It can be seen from Fig. 7 that the purity of the proposed algorithm has been maintained at 1 after the 17,000 data. In the early stage of data, the purity of the proposed method is lower than the MuDi-stream algorithm and higher than the Denstream and D-stream algorithms. These comparison algorithms do not show high performance in the early stage of data. This phenomenon also appears in Fig. 8. It is deduced from the analysis of the dataset that the reason for this phenomenon is that these seven types of data appear in the early stage of the dataset, and are very scattered in space and time, which affects the algorithm's judgment of the number of clusters and the clustering results. Statistics are made on the average purity and average RI of the algorithm. It can be seen from Table 8 and Fig. 9 that the proposed algorithm has the highest comprehensive performance.
5 Conclusions
This paper proposes a dynamic weight and density peaks clustering algorithm for the data stream. In the online stage, a dynamic weight calculation method is proposed. When a new data point is merged into the corresponding micro-cluster, the algorithm updates the micro-cluster weight based on the distance between the point and the center of the micro-cluster, which helps to more accurately describe the micro-cluster. In the offline stage, to deal with clusters of arbitrary shape and quickly find the cluster centers, a novel density peak clustering algorithm based on natural neighbors is proposed. It adaptively obtains information such as the local density of the data, which reduces the parameters input. A two-stage allocation method is then proposed to reduce the probability of the "domino effect". The clustering performance of the algorithm is tested in the static and streaming environments. The experimental results show that the proposed algorithm can efficiently improve the clustering quality.
Data Availability
All datasets are available, and the UCI repository provides real datasets generated and/or analyzed during the current study (D. Dua, C. Graff, UCI machine learning repository. http://archive.ics.uci.edu/ml).
Abbreviations
- PMC:
-
Potential core micro-clusters
- OMC:
-
Outlier micro-clusters
- DBSCAN:
-
Density-based spatial clustering of applications with noise
- DPC:
-
Density peaks cluster
- KNN:
-
K nearest neighbor
- DWDP-Stream:
-
A dynamic weight and density peaks clustering algorithm for data stream
- UCI:
-
University of California, Irvine
- Acc:
-
Clustering accuracy
- NMI:
-
Normalized mutual information
- RI:
-
Rand index
- ARI:
-
Adjusted rand index
References
Wang, Y., Li, J., Yang, B., et al.: Stream-data-clustering based adaptive alarm threshold setting approaches for industrial processes with multiple operating conditions. ISA Trans. 2, 2 (2022)
Stephanie, V., Chamikara, M.A.P., Khalil, I., et al.: Privacy-preserving location data stream clustering on mobile edge computing and cloud. Inf. Syst. 2, 101728 (2021)
Zhu, E., Ma, R.: An effective partitional clustering algorithm based on new clustering validity index. Appl. Soft Comput. 71, 608–621 (2018)
Lu, Y., Cheung, Y.M., Tang, Y.Y.: Self-adaptive multiprototype-based competitive learning approach: a k-means-type algorithm for imbalanced data clustering. IEEE Trans. Cybern. 51(3), 1598–1612 (2019)
Hu, L., Liu, H., Zhang, J., et al.: KR-DBSCAN: a density-based clustering algorithm based on reverse nearest neighbor and influence space. Expert Syst. Appl. 186, 115763 (2021)
Neto, A.C.A., Sander, J., Campello, R., et al.: Efficient computation and visualization of multiple density-based clustering hierarchies. IEEE Trans. Knowl. Data Eng. 33(8), 3075–3089 (2021)
Komaru, Y., Yoshida, T., Hamasaki, Y., et al.: Hierarchical clustering analysis for predicting 1-year mortality after starting hemodialysis. Kidney Int. Rep. 5(8), 1188–1195 (2020)
Cai, Q., Liu, J.: Hierarchical clustering of bipartite networks based on multi-objective optimization. IEEE Trans. Netw. Sci. Eng. 7(1), 421–434 (2018)
Cheng, M., Ma, T., Ma, L., et al.: Adaptive grid-based forest-like clustering algorithm. Neurocomputing 481, 168–181 (2022)
Tareq, M., Sundararajan, E.A., Mohd, M., et al.: Online clustering of evolving data streams using a density grid-based method. IEEE Access 8, 166472–166490 (2020)
Ma, Q., Li, S., Zhuang, W., et al.: Self-supervised time series clustering with model-based dynamics. IEEE Trans. Neural Netw. Learn. Syst. 32(9), 3942–3955 (2020)
Zhu, J., Guo, R., Li, Z., et al.: Registration of multi-view point sets under the perspective of expectation-maximization. IEEE Trans. Image Process. 29, 9176–9189 (2020)
Chen, Y., Zhou, L., Pei, S., et al.: KNN-BLOCK DBSCAN: fast clustering for large-scale data. IEEE Trans. Syst. Man Cybern. Syst. 51(6), 3939–3953 (2019)
Zhang, R., Du, T., Qu, S., et al.: Adaptive density-based clustering algorithm with shared KNN conflict game. Inf. Sci. 565, 344–369 (2021)
Bagozi, A., Bianchini, D., De Antonellis, V.: Multi-level and relevance-based parallel clustering of massive data streams in smart manufacturing. Inf. Sci. 577, 805–823 (2021)
Al-Shammari, A., Zhou, R., Naseriparsaa, M., et al.: An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. Int. J. Med. Inf. 126, 176–186 (2019)
Chaolong, J., Hanning, W., Lili, W.: Study of smart transportation data center virtualization based on vmware vsphere and parallel continuous query algorithm over massive data streams. Proc. Eng. 137, 719–728 (2016)
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 133–142 (2007)
Amini, A., Saboohi, H., Herawan, T., et al.: MuDi-stream: a multi density clustering algorithm for evolving data stream. J. Netw. Comput. Appl. 59, 370–385 (2016)
Aggarwal, C.C., Philip, S.Y., Han, J., et al.: A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference. Morgan Kaufmann, pp 81–92 (2003)
Cao, F., Estert, M., Qian, W., et al.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 328–339 (2006)
Gao, T., Li, A., Meng, F.: Research on data stream clustering based on FCM algorithm1. Proc. Comput. Sci. 122, 595–602 (2017)
Laohakiat, S., Sa-Ing, V.: An incremental density-based clustering framework using fuzzy local clustering. Inf. Sci. 547, 404–426 (2021)
Ruiz, C., Menasalvas, E., Spiliopoulou, M.: C-denstream: using domain knowledge on a data stream. In: International Conference on Discovery Science. Springer, Berlin, Heidelberg, pp 287–301 (2009)
Liu, L., Guo, Y., Kang, J., et al.: A three-step clustering algorithm over an evolving data stream. In: 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems. IEEE, 1: 160–164 (2009)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Xie, J., Gao, H., Xie, W., et al.: Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 354, 19–40 (2016)
Liu, R., Wang, H., Yu, X.: Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 450, 200–226 (2018)
Lotfi, A., Moradi, P., Beigy, H.: Density peaks clustering based on density backbone and fuzzy neighborhood. Pattern Recogn. 107, 107449 (2020)
Zhu, Q., Feng, J., Huang, J.: Natural neighbor: a self-adaptive neighborhood method without parameter K. Pattern Recogn. Lett. 80, 30–36 (2016)
Acknowledgements
We thank openbiox community and Hiplot team (https://hiplot.com.cn) for providing technical assistance and valuable tools for data analysis and visualization.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grants with No. 62273164, No. 61873324, No. 61903156, the Natural Science Foundation of Shandong Province with ZR2019MF040, the Higher Educational Science and Technology Program of Jinan City under Grant with No. 2020GXRC057.
Author information
Authors and Affiliations
Contributions
Conceptualization: [CD], [DT], [ZJ]; methodology: [CD]; formal analysis and investigation: [WY], [WX]; writing—original draft preparation: [CD], [DT]; writing—review and editing: [ZJ]; funding acquisition: [ZJ].
Corresponding authors
Ethics declarations
Conflict of Interest
The content of this article has not been published in any other publication or media in any other language. It is original research work. We declare that there is no conflict of interest regarding the publication of this paper.
Ethics Approval and Consent to Participate
Not applicable.
Consent for Publication
All the authors approved the final manuscript and the submission to this journal.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chen, D., Du, T., Zhou, J. et al. DWDP-Stream: A Dynamic Weight and Density Peaks Clustering Algorithm for Data Stream. Int J Comput Intell Syst 15, 96 (2022). https://doi.org/10.1007/s44196-022-00157-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44196-022-00157-7