Abstract
Mean shift is a popular and powerful clustering method for implementing density attractor clustering (DAC). However, DAC is underdeveloped in terms of modeling definitions and methods for incomplete data. Due to DAC’s importance, solving this common issue is crucial. This work makes DAC more versatile by making it applicable to incomplete data: First, using formal modeling definitions, we propose a unifying framework for DAC. Second, we propose new methods that implement the definitions and perform DAC for incomplete data more efficiently and stably than others. We discuss and compare our methods and the closest competitor using theoretical analyses. We quantify the performance of our methods using synthetic datasets with known structures and real-life business data for three missing value types. Finally, we analyze Stack Overflow’s 2021 survey to extract clusters of programmers from India and the USA. The experiments verify our methods’ superiority to six alternatives. Code, Data: https://bit.ly/genDAC
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Additionally to this acronym, warp kernels get their name in analogy to celestial bodies in astronomy: they describe how their objects’ (information) mass “warp” the target space (see Fig. 1) causing to “gravitate” to the modes, which are mass accumulations.
An affine space projection (a vector-valued function) of a point onto the affine space is the unique point , such that is an element of the complementary subspace of in . In other words, every vector of can be decomposed uniquely as the sum of an element of and an element of the complementary subspace. Here, the implementation is .
These formula are verbatim from the original publication. Their symbols do not adhere to our notation; so, to avoid confusion, we use the typewriter font for their symbols and notation. Symbols denoting their mathematical objects might overlap with ours regarding letters, but are distinguishable from ours via font.
see footnote on page 25.
see footnote on page 27
References
Abdallah L, Shimshoni I (2014) Mean shift clustering algorithm for data with missing values. In: International Conference on Data Warehousing and Knowledge Discovery, vol 8646. Springer, pp 426–438
Agamennoni G (2013) Bayesian clustering with outliers and missing values. Report ACFR-TR-2013-001, Australian Centre for Field Robotics
Bacher J, Pöge A, Wenzig K (2000) Clusteranalyse 3.A.: anwendungsorientierte einführung in klassifikationsverfahren. Oldenbourg Wissenschaftsverlag
Banerjee A, Dhillon I, Ghosh J et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986
Biessmann F, Rukat T, Schmidt P et al (2019) Datawig: missing value imputation for tables. J Mach Learn Res 20(175):1–6
van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18(6):681–94
Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Advances in knowledge discovery and data mining. Springer, pp 160–172
Carreira-Perpiñán MÁ (2015) A review of mean-shift algorithms for clustering. In: CRC Handbook of cluster analysis. CRC Press, Boca Raton, Florida
Chacón JE, Duong T (2020) Multivariate kernel smoothing and its applications, Monogr. Stat. Appl. Probab., vol 160. Chapman and Hall/CRC
Chau VTN, Loc PH, Tran VTN (2015) A robust mean shift-based approach to effectively clustering incomplete educational data. In: International conference on advanced computing and applications (ACOMP), pp 12–19
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Fashing M, Tomasi C (2005) Mean shift is a bound optimization. IEEE Trans Pattern Anal Mach Intell 27(3):471–474
Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40
Günnemann S, Müller E, Raubach S, et al. (2011) Flexible fault tolerant subspace clustering for data with missing values. In: 11th IEEE International Conference on Data Mining, pp 231–240
Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Cybern 31(5):735–744
Helm MS, Dankovich TM, Mandad S et al (2021) A large-scale nanoscopy and biochemistry analysis of postsynaptic dendritic spines. Nat Neurosci 24:1151–1162
Himmelspach L, Conrad S (2010) Clustering approaches for data with missing values: comparison and evaluation. In: 5th International conference on digital information management (ICDIM)
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(1):913–933
Jäger S, Allhorn A, Bießmann F (2021) A benchmark for data imputation methods. Frontiers in Big Data 4
Leibrandt K, Lorenz T, Nierhoff T, et al. (2013) Modelling human gameplay at pool and countering it with an anthropomorphic robot. In: Social robotics. Springer, pp 30–39
Leibrandt R, Günnemann S (2018) Making kernel density estimation robust towards missing values in highly incomplete multivariate data without imputation. In: SIAM International Conference on Data Mining
Leibrandt R, Günnemann S (2020) Gauss shift: Density attractor clustering faster than mean shift. In: Eur. Conf. Princ. Pract. Knowl. Discov. Databases
Liao L, Li K, Li K, et al. (2018) A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics. BMC Syst Biol 12(111)
Loader CR (1999) Bandwidth selection: classical or plug-in? Ann Stat 27(2):415–438
Muzellec B, Josse J, Boyer C et al. (2020) Missing data imputation using optimal transport. International Conference on Machine Learning PMLR, pp 7130–7140
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196
Romano S, Bailey J, Nguyen V, et al. (2014) Standardized mutual information for clustering comparisons: one step further in adjustment for chance. In: International Conference on Machine Learning, pp 1143–1151
Romano S, Vinh NX, Bailey J et al (2016) Adjusting for chance clustering comparison measures. J Mach Learn Res 17(134):1–32
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Schelter S, Rukat T, Biessmann F (2020) Learning to validate the predictions of black box classifiers on unseen data. In: ACM SIGMOD International Conference on Management of Data, p 1289-1299
Schnupp P, Leibrandt U (1988) Expertensysteme: Nicht nur für Informatiker. Springer, Springer Compass
Shortliffe EH, Buchanan BG (1975) A model of inexact reasoning in medicine. Math Biosci 23(3–4):351–379
Stack Overflow (2021) Stack Overflow developer survey 2021. https://insights.stackoverflow.com/survey
Steinley D, Brusco MJ, Hubert L (2016) The variance of the adjusted rand index. Psychol Methods 21(2):261–72
Timm H, Döring C, Kruse R (2002) Fuzzy cluster analysis of partially missing datasets. In: 2nd Int. W. on Hybr. Meth. for Adap. Sys. I, pp 426–431
Wagstaff KL (2004) Clustering with missing values: no imputation required. In: Meet. Int. Fed. Classif. Soc., pp 649–658
Wand M, Jones MC (1995) Kernel Smoothing. Chapman and Hall/CRC
Xue Z, Wang H (2021) Effective density-based clustering algorithms for incomplete data. Big Data Min Anal 4(3):183–194
Yang L, Hou K (2018) A method of incomplete data three-way clustering based on density peaks. In: International conference on computer-aided design, manufacturing, Modeling and Simulation, p 020008
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Albrecht Zimmermann and Peggy Cellier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: in-depth discussion of msMED and proof that Eqs. (7) to (13) are equal to the original formulation
1.1 A.1 Terminology
In Appendix A, we will use the term to denote a warp kernel that is used to construct a density and to denote a warp kernel that is used during the mean shift search. and denote classic standard kernels when used in the construction of a density warp kernel and search warp kernel, respectively.
1.2 A.2 Derivation of the mean Euclidean distance of Eqs. (7)
We combine the Eqs.Footnote 3 (1) to (3), namely
from Abdallah and Shimshoni (2014) to Eq. (7)
\(\chi _i\) in Eq. (2) denotes the distribution of the i-th coordinate according to Abdallah and Shimshoni (2014). Their Eq. (2) deviates in notation from their Eq. (3); our Eq. (7) harmonizes this discrepancy. The reason for taking the square root of original distances (by raising the left-hand side of Eq. (7) by two) is discussed in Sect. A.3.
Additionally to , using Eq. (21), we define
1.3 A.3 Discussion of errors and issues in previous work
After inspecting the equations and wording across Abdallah and Shimshoni (2014) and in consideration of their code, we suspect there are two sets of typing errors and issues:
(i) On the one hand, Abdallah and Shimshoni (2014) define and – not and – to be the respective right-hand sides in Eq. (7). In the case of , their distance is the squared Euclidean distance, which does not fulfill the triangular inequality. Comaniciu and Meer (2002) (on which Abdallah and Shimshoni (2014) seem to build) use the squared Euclidean distance as well.
In practice, both publications start with original standard kernels that square the input and redefine them by removing the square. In the following, and are such kernels. For example, the “Gaussian kernel” would be redefined as . This results in the same algorithm, as if they had used the original standard kernel (e.g., a true Gaussian kernel ) and the Euclidean distance.
On the other hand, in their formulas (11) and (13), revisited in Eq. (A4) and (A8) in Sect. A.4, A.5.1 and A.5.2, Abdallah and Shimshoni (2014) use a non-squared version of the distances, which is the Euclidean distance if .
To correct this inconsistency one should either choose the squared or the non-squared distances. We believe it to be beneficial to keep the theoretical foundations simple; in this case, we believe it makes more sense to ensure that the defined distance is the Euclidean distance in the case of . In other words, we should be using the square root of the right-hand sides of the Eqs. (1), (2), (3). This is what we remedy by the reformulations. So, in Sect. 1, applies. On the other hand, for Eqs. (A4) and (A8) in Sects. A.4, A.5.1 and A.5.2, applies; or would apply if it were not for a second issue in the case of Eq. (A4):
(ii) In the formulas between (12) and (13) of Abdallah and Shimshoni (2014) the bandwidth is missing. There are two possible ways to rectify this. One way is to change their equation (Abdallah and Shimshoni 2014, section 3) to . However, to stay consistent with the rest of their work and Comaniciu and Meer (2002), this does not seem to be the right approach. Instead, we reintroduce by defining for their Eq. (13), revisited as Eq. (A4) in Sect. A.4, and use .
1.4 A.4 Proof of Eqs. (8) and (12)
In their equationFootnote 4 (13), Abdallah and Shimshoni (2014) define the mean shift vector for the \( \texttt{ l } \)-th coordinate at the position \( \texttt{ x } \) as
where there are objects for which the \( \texttt{ l } \)-th coordinate of object is known, for which the \( \texttt{ l } \)-th coordinate is missing, and objects in total. The superscript refers to the respective coordinate, e.g., \(\mu ^{1}\) refers to what we denote as . To fit better with the rest of our notation, we replace the symbols accordingly and, using , rewrite Eq. (A4) to
With
and constant cancelling out from the numerator and denominator, we get
The constant makes no practical difference yet, but is required for theoretical reasons, as seen in the following sections.
1.5 A.5 Proof of Eqs. (9) and (13)
1.5.1 A.5.1 Original formulation of the densities in Abdallah and Shimshoni (2014)
We first consider the density as defined by Abdallah and Shimshoni (2014) in their equationFootnote 5 (6), namely
for complete datasets and equation (11), namely
for incomplete datasets, where is one of the object tuples with entries. \( \texttt{ h } \) is the bandwidth, and is a normalization for the .
1.5.2 A.5.2 Reformulating the density and density warp kernel
As discussed in Sect. A.3, we decide that
Definition 5
(Profile kernel) We want the profile kernels to be formulated such that their input generalizes the Euclidean distance – a distance that equals the Euclidean distance if no values are missing. Formally, we will use a profile kernel
applies and, to fit better with the rest of our notation, we rewrite Eq. (A7) to
Similarly, we rewrite the density for incomplete data from Eq. (A8) with to
Considering the definition of the density in Definition 3, the density warp kernel of Abdallah and Shimshoni (2014) is
1.5.3 A.5.3 Reformulating the search warp kernel
To reformulate the search warp kernel in common terminology, we build on the definitions
from which follows
with which we reformulate Eq. (A5) to
Next, we discuss the constant . For that, we require a condition that links \(_{{\textsf {MED}}} {\textit{w}}_{\omega }\) and . Keeping Eq. (5) in mind, we compare \(_{{\textsf {MED}}} {\textit{w}}_{\omega }\) and with
and
We recognize that can be chosen arbitrarily, as long as (i) is a constant (among other things, must not depend on \({i} \)), (ii) , and (iii) Eq. (A14) holds; in other words, . This means, the relationship between \({}_{{\textsf {MED}}} {\textit{w}}_{\omega }\) and depends on the choice of – and vice versa. Following this, we first make a reasonable decision on the relationship first and determine afterwards. Similarly to Eq. (A14), the derivative of the tube warp kernel for the density estimation (Eq. (18)), namely
is proportional to the tube warp kernel for mean shift search, namely \(\hat{{\textit{v}}}_{i}\) (the two tube warp kernel – for the density and mean shift search – are identical). Eq. (A14) and (19) motivate us to define the relationship as
Definition 6
(Relationship warp kernels) Similar to Eq. (5), namely
we define
which provides
and thus, using , an
under which our condition Definition 5 holds true.
1.6 A.6 Proof of Eqs. (10) and (11)
Eqs. (10) and (11) follow directly from Eqs. (A6), (7) and (A12): Instead of , we use the total mean Euclidean distance . This results in in Eq. (11), instead of in Eq. (A12). Also, this results in Eq. (10), instead of Eq. (A6).
Appendix B: Proof: Eqs. (7) and (20) are equal
The binary operation of convolution \(*\) is defined as
where , and each of f and g can be a function or Schwartz distribution (also known as a generalized function). The (a Schwartz distribution) is the neutral element of convolution, meaning ; in addition,
Abdallah and Shimshoni (2014) define (p. 4-6) the squared mean Euclidean distance as
where is the marginal PDF of the respective feature. In the following we show that it is possible to write the definition more compactly as
In addition, we define
Proof
We prove that Eq. (B17) is equal to the first line of Eq. (B16) if :
\(\square \)
Proof
Next, we prove that Eq. (B17) is equal to the third line of Eq. (B16) if . The four equations from Eq. (B23) to Eq. (B24) only assume and provide, with Eq. (B20),
\(\square \)
Proof
The proof that Eq. (B17) is equal to the second line of Eq. (B16) if is analogous to the last proof. \(\square \)
Proof
Finally, we prove that Eq. (B17) is equal to the fourth line of Eq. (B16) if .
\(\square \)
Appendix C: Details of performance measures
Several methods for measuring clustering effectiveness exist. Because internal measures, such as the inertia or silhouette coefficient, are unsuitable for evaluating non-convex cluster families, we used 12 external measures on reference labels. Three large groups of external measures exist: (i) those that examine if object pairs are in the same group as in the reference, (ii) those that build on classical information theory, and (iii) those adopted from supervised learning. Of the first group, we employed the Rand index (RI), adjusted Rand index (ARI), and Fowlkes-Mallows index (FMI); of the second, normalized mutual information (NMI), and adjusted mutual information (AMI); of the third, accuracy. For NMI and AMI, we tested four (min, max, mean, geometric) aggregator functions.
Our final choice of measures is motivated by our own findings and by literature, e.g., by Romano et al. (2016): While the specific values can vary between those measures, they usually rank the methods in the same way per dataset, as substantiated in Fig. 5 and 8 . The advantage of ARI and AMI is that they correct for chance: They approach zero for random labeling for an increasing number of objects. In contrast, unadjusted measures approach one. FMI adjusts for chance, but is considerably less strict, which reduces its explanatory power considerably. AMI’s min and geometric aggregators are not reliable for particularly small or large and thus cannot be used for parameterization, so we use the classic max aggregator (mean aggregator is very similar in results).
Appendix D: Synthetic datasets with known label structure (Sect. 4.2): Additional measurements
In the paper, we presented experiments with the four labeled benchmark datasets aggregationB01, R15, moons, and flame. Fig. 8 displays additional results for those datasets: Here we display the adjusted Rand index and the adjusted mutual information for the cases where we consider (i) only incomplete objects, (ii) all objects, including the outliers. When we consider only incomplete objects, the difference is not too large to when we consider all non-outliers. This is because many objects are outliers as seen on the top x-axis. When we consider all objects, including outliers, the performance of our methods is mostly the same as before, however other methods are not too far off anymore. The reason is simply that those other methods are able to label the outliers as outliers, but are not able to recover the main clusters as such, as soon as missing values are introduced. Thus, their seemingly good results are due to their ability to detect outliers, not their ability to cluster correctly.
Appendix E: Real-life data (Sect. 4.3): Distribution of each measure for each incompleteness type
To perform a significance test, we want to determine the range, the , in which a measure will fall with 95% probability. Thus, we need to know the distribution of each measure (ARI, AMI, accuracy) for each incompleteness type (MCAR, MAR, MNAR):
Steinley et al. (2016) showed that a normal distribution approximates the ARI very closely, especially for the construction of confidence intervals. However, they studied the ARI for a constant number of (reference) clusters, a constant number of objects in each cluster, and under the assumption of a null model for randomness that is constructed from a generalized hypergeometric distribution (Hubert and Arabie 1985). This so-called , or , generates clusters uniformly and randomly via permutations of labels, as described by Romano et al. (2014). They, and others, study the AMI under the same assumptions. Dietterich (1998) describes proportions of misclassifications to be normally distributed, which translates to accuracy.
While those results provide some evidence to use a normal distribution for our case, the assumptions/scenarios are not the same. Thus, to add experimental evidence, we created about 50 different datasets from the real-life data of Sect. 4.3 by randomly deleting 20% of the values for each incompleteness type. This produced nine measure samples (each with 50 measurements) – one for each combination of ARI, AMI, accuracy with each incompleteness type.
The Cullen and Frey graphs in Fig. 9 display the kurtosis and the square of skewness for each of the nine distributions of the measure samples as well as some theoretical distributions. Additionally, for each measure sample, we used bootstrapping to create 50 resampled samples, which are also displayed in the graphs in terms of their distributions’ kurtosis and square of skewness. The graphs show that the measure sample distributions have a kurtosis close to 3 and are not or very little skewed. Tab. 1 reports exact values. This suggests that the measure distributions can be modeled with normal distributions.
Next, we min-max normalized the measure samples into the range [0, 1] and fitted different theoretical distributions to the normalized samples. The QQ plots in Fig. 10 compare the quantiles of the normalized measure sample distributions with the quantiles of the theoretical distributions. The plots show that the normal and the logistic distributions model the measures well.
Finally, we performed hypothesis tests for different theoretical distributions: For each measure and incompleteness type, the Tab. 2 provides the p-values of the Anderson-Darling hypothesis test for different distributions. The Anderson-Darling hypothesis tests show that each measure for each incompleteness type is either modeled best by the normal or the logistic distribution. Tab. 2 shows such large p-values that we cannot reject the null hypothesis that the measure samples follow a normal or logistic distribution.
When plotting the 95% confidence intervals based on those distributions, they are indistinguishable for variances between 0 and 1. This is the case here, because the AMI, ARI, and accuracy can only take values between 0 and 1.
Appendix F: Real-life data (Sect. 4.3): Statistical significance and model variance
The experiments on the real-life data of Sect. 4.3 in the main document demonstrated that the competing methods are statistically significantly worse than our method. Now, we can additionally formulate (and later reject) the null hypothesis that our methods are not better than the competitors. For that, we determine the standard deviation of the respective measures of the competitors, not just for our methods. We calculate the range in which a measure of the respective method will fall with 95% probability. In Fig. 11, we plot the ranges. The lower bars in Fig. 11 denote the threshold for the 0.025 p-value. This is twice as strict as the p-value threshold of 0.05, yet we can reject the null hypothesis – most of the time by a large margin – and claim that our methods are significantly better than our method.
There is the conspicuous exception that we cannot reject the null hypothesis for BRMM. However, this is due to BRMM ’s high model variance: repeated application of the algorithm results in very different results. Thus, one should reject the conclusion that our method is not better, but take the additional evidence into account that other methods manage to keep the model variance low. In fact, our methods are particularly consistent in their results with excellent low model variance, as demonstrated by the error bars in Fig. 11.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Leibrandt, R., Günnemann, S. Generalized density attractor clustering for incomplete data. Data Min Knowl Disc 37, 970–1009 (2023). https://doi.org/10.1007/s10618-022-00904-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00904-6