Generalized density attractor clustering for incomplete data

Leibrandt, Richard; Günnemann, Stephan

doi:10.1007/s10618-022-00904-6

Generalized density attractor clustering for incomplete data

Published: 18 January 2023

Volume 37, pages 970–1009, (2023)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Mean shift is a popular and powerful clustering method for implementing density attractor clustering (DAC). However, DAC is underdeveloped in terms of modeling definitions and methods for incomplete data. Due to DAC’s importance, solving this common issue is crucial. This work makes DAC more versatile by making it applicable to incomplete data: First, using formal modeling definitions, we propose a unifying framework for DAC. Second, we propose new methods that implement the definitions and perform DAC for incomplete data more efficiently and stably than others. We discuss and compare our methods and the closest competitor using theoretical analyses. We quantify the performance of our methods using synthetic datasets with known structures and real-life business data for three missing value types. Finally, we analyze Stack Overflow’s 2021 survey to extract clusters of programmers from India and the USA. The experiments verify our methods’ superiority to six alternatives. Code, Data: https://bit.ly/genDAC

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Detecting drifts in data streams using Kullback-Leibler (KL) divergence measure for data engineering applications

Article Open access 29 May 2024

Applying INAR-Hidden Markov Chains in the Analysis of Under-Reported Data

Concept drift detection and localization framework based on behavior replacement

Article 16 December 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Additionally to this acronym, warp kernels get their name in analogy to celestial bodies in astronomy: they describe how their objects’ (information) mass “warp” the target space (see Fig. 1) causing to “gravitate” to the modes, which are mass accumulations.
An affine space projection (a vector-valued function) of a point onto the affine space is the unique point , such that is an element of the complementary subspace of in . In other words, every vector of can be decomposed uniquely as the sum of an element of and an element of the complementary subspace. Here, the implementation is .
These formula are verbatim from the original publication. Their symbols do not adhere to our notation; so, to avoid confusion, we use the typewriter font for their symbols and notation. Symbols denoting their mathematical objects might overlap with ours regarding letters, but are distinguishable from ours via font.
see footnote on page 25.
see footnote on page 27

References

Abdallah L, Shimshoni I (2014) Mean shift clustering algorithm for data with missing values. In: International Conference on Data Warehousing and Knowledge Discovery, vol 8646. Springer, pp 426–438
Agamennoni G (2013) Bayesian clustering with outliers and missing values. Report ACFR-TR-2013-001, Australian Centre for Field Robotics
Bacher J, Pöge A, Wenzig K (2000) Clusteranalyse 3.A.: anwendungsorientierte einführung in klassifikationsverfahren. Oldenbourg Wissenschaftsverlag
Banerjee A, Dhillon I, Ghosh J et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986
MathSciNet MATH Google Scholar
Biessmann F, Rukat T, Schmidt P et al (2019) Datawig: missing value imputation for tables. J Mach Learn Res 20(175):1–6
MATH Google Scholar
van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18(6):681–94
Article Google Scholar
Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Advances in knowledge discovery and data mining. Springer, pp 160–172
Carreira-Perpiñán MÁ (2015) A review of mean-shift algorithms for clustering. In: CRC Handbook of cluster analysis. CRC Press, Boca Raton, Florida
Chacón JE, Duong T (2020) Multivariate kernel smoothing and its applications, Monogr. Stat. Appl. Probab., vol 160. Chapman and Hall/CRC
Chau VTN, Loc PH, Tran VTN (2015) A robust mean shift-based approach to effectively clustering incomplete educational data. In: International conference on advanced computing and applications (ACOMP), pp 12–19
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619
Article Google Scholar
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Article Google Scholar
Fashing M, Tomasi C (2005) Mean shift is a bound optimization. IEEE Trans Pattern Anal Mach Intell 27(3):471–474
Article Google Scholar
Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40
Article MathSciNet MATH Google Scholar
Günnemann S, Müller E, Raubach S, et al. (2011) Flexible fault tolerant subspace clustering for data with missing values. In: 11th IEEE International Conference on Data Mining, pp 231–240
Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Cybern 31(5):735–744
Google Scholar
Helm MS, Dankovich TM, Mandad S et al (2021) A large-scale nanoscopy and biochemistry analysis of postsynaptic dendritic spines. Nat Neurosci 24:1151–1162
Article Google Scholar
Himmelspach L, Conrad S (2010) Clustering approaches for data with missing values: comparison and evaluation. In: 5th International conference on digital information management (ICDIM)
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article MATH Google Scholar
Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(1):913–933
Article Google Scholar
Jäger S, Allhorn A, Bießmann F (2021) A benchmark for data imputation methods. Frontiers in Big Data 4
Leibrandt K, Lorenz T, Nierhoff T, et al. (2013) Modelling human gameplay at pool and countering it with an anthropomorphic robot. In: Social robotics. Springer, pp 30–39
Leibrandt R, Günnemann S (2018) Making kernel density estimation robust towards missing values in highly incomplete multivariate data without imputation. In: SIAM International Conference on Data Mining
Leibrandt R, Günnemann S (2020) Gauss shift: Density attractor clustering faster than mean shift. In: Eur. Conf. Princ. Pract. Knowl. Discov. Databases
Liao L, Li K, Li K, et al. (2018) A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics. BMC Syst Biol 12(111)
Loader CR (1999) Bandwidth selection: classical or plug-in? Ann Stat 27(2):415–438
Article MathSciNet MATH Google Scholar
Muzellec B, Josse J, Boyer C et al. (2020) Missing data imputation using optimal transport. International Conference on Machine Learning PMLR, pp 7130–7140
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196
Article Google Scholar
Romano S, Bailey J, Nguyen V, et al. (2014) Standardized mutual information for clustering comparisons: one step further in adjustment for chance. In: International Conference on Machine Learning, pp 1143–1151
Romano S, Vinh NX, Bailey J et al (2016) Adjusting for chance clustering comparison measures. J Mach Learn Res 17(134):1–32
MathSciNet MATH Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Article MathSciNet MATH Google Scholar
Schelter S, Rukat T, Biessmann F (2020) Learning to validate the predictions of black box classifiers on unseen data. In: ACM SIGMOD International Conference on Management of Data, p 1289-1299
Schnupp P, Leibrandt U (1988) Expertensysteme: Nicht nur für Informatiker. Springer, Springer Compass
Book Google Scholar
Shortliffe EH, Buchanan BG (1975) A model of inexact reasoning in medicine. Math Biosci 23(3–4):351–379
Article MathSciNet Google Scholar
Stack Overflow (2021) Stack Overflow developer survey 2021. https://insights.stackoverflow.com/survey
Steinley D, Brusco MJ, Hubert L (2016) The variance of the adjusted rand index. Psychol Methods 21(2):261–72
Article Google Scholar
Timm H, Döring C, Kruse R (2002) Fuzzy cluster analysis of partially missing datasets. In: 2nd Int. W. on Hybr. Meth. for Adap. Sys. I, pp 426–431
Wagstaff KL (2004) Clustering with missing values: no imputation required. In: Meet. Int. Fed. Classif. Soc., pp 649–658
Wand M, Jones MC (1995) Kernel Smoothing. Chapman and Hall/CRC
Xue Z, Wang H (2021) Effective density-based clustering algorithms for incomplete data. Big Data Min Anal 4(3):183–194
Article Google Scholar
Yang L, Hou K (2018) A method of incomplete data three-way clustering based on density peaks. In: International conference on computer-aided design, manufacturing, Modeling and Simulation, p 020008

Download references

Author information

Authors and Affiliations

Data Analytics and Machine Learning, Department of Informatics, Technical University of Munich, Boltzmannstr. 3, 85748, Garching, Germany
Richard Leibrandt & Stephan Günnemann

Authors

Richard Leibrandt
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Günnemann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard Leibrandt.

Additional information

Responsible editor: Albrecht Zimmermann and Peggy Cellier.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: in-depth discussion of msMED and proof that Eqs. (7) to (13) are equal to the original formulation

1.1 A.1 Terminology

In Appendix A, we will use the term to denote a warp kernel that is used to construct a density and to denote a warp kernel that is used during the mean shift search. and denote classic standard kernels when used in the construction of a density warp kernel and search warp kernel, respectively.

1.2 A.2 Derivation of the mean Euclidean distance of Eqs. (7)

We combine the Eqs.^{Footnote 3} (1) to (3), namely

from Abdallah and Shimshoni (2014) to Eq. (7)

\(\chi _i\) in Eq. (2) denotes the distribution of the i-th coordinate according to Abdallah and Shimshoni (2014). Their Eq. (2) deviates in notation from their Eq. (3); our Eq. (7) harmonizes this discrepancy. The reason for taking the square root of original distances (by raising the left-hand side of Eq. (7) by two) is discussed in Sect. A.3.

Additionally to , using Eq. (21), we define

(A1)

(A2)

(A3)

1.3 A.3 Discussion of errors and issues in previous work

After inspecting the equations and wording across Abdallah and Shimshoni (2014) and in consideration of their code, we suspect there are two sets of typing errors and issues:

(i) On the one hand, Abdallah and Shimshoni (2014) define and – not and – to be the respective right-hand sides in Eq. (7). In the case of , their distance is the squared Euclidean distance, which does not fulfill the triangular inequality. Comaniciu and Meer (2002) (on which Abdallah and Shimshoni (2014) seem to build) use the squared Euclidean distance as well.

In practice, both publications start with original standard kernels that square the input and redefine them by removing the square. In the following, and are such kernels. For example, the “Gaussian kernel” would be redefined as . This results in the same algorithm, as if they had used the original standard kernel (e.g., a true Gaussian kernel ) and the Euclidean distance.

On the other hand, in their formulas (11) and (13), revisited in Eq. (A4) and (A8) in Sect. A.4, A.5.1 and A.5.2, Abdallah and Shimshoni (2014) use a non-squared version of the distances, which is the Euclidean distance if .

To correct this inconsistency one should either choose the squared or the non-squared distances. We believe it to be beneficial to keep the theoretical foundations simple; in this case, we believe it makes more sense to ensure that the defined distance is the Euclidean distance in the case of . In other words, we should be using the square root of the right-hand sides of the Eqs. (1), (2), (3). This is what we remedy by the reformulations. So, in Sect. 1, applies. On the other hand, for Eqs. (A4) and (A8) in Sects. A.4, A.5.1 and A.5.2, applies; or would apply if it were not for a second issue in the case of Eq. (A4):

(ii) In the formulas between (12) and (13) of Abdallah and Shimshoni (2014) the bandwidth is missing. There are two possible ways to rectify this. One way is to change their equation (Abdallah and Shimshoni 2014, section 3) to . However, to stay consistent with the rest of their work and Comaniciu and Meer (2002), this does not seem to be the right approach. Instead, we reintroduce by defining for their Eq. (13), revisited as Eq. (A4) in Sect. A.4, and use .

1.4 A.4 Proof of Eqs. (8) and (12)

In their equation^{Footnote 4} (13), Abdallah and Shimshoni (2014) define the mean shift vector for the \( \texttt{ l } \)-th coordinate at the position \( \texttt{ x } \) as

(A4)

where there are objects for which the \( \texttt{ l } \)-th coordinate of object is known, for which the \( \texttt{ l } \)-th coordinate is missing, and objects in total. The superscript refers to the respective coordinate, e.g., \(\mu ^{1}\) refers to what we denote as . To fit better with the rest of our notation, we replace the symbols accordingly and, using , rewrite Eq. (A4) to

With

(A5)

and constant cancelling out from the numerator and denominator, we get

(A6)

The constant makes no practical difference yet, but is required for theoretical reasons, as seen in the following sections.

1.5 A.5 Proof of Eqs. (9) and (13)

1.5.1 A.5.1 Original formulation of the densities in Abdallah and Shimshoni (2014)

We first consider the density as defined by Abdallah and Shimshoni (2014) in their equation^{Footnote 5} (6), namely

(A7)

for complete datasets and equation (11), namely

(A8)

for incomplete datasets, where is one of the object tuples with entries. \( \texttt{ h } \) is the bandwidth, and is a normalization for the .

1.5.2 A.5.2 Reformulating the density and density warp kernel

As discussed in Sect. A.3, we decide that

Definition 5

(Profile kernel) We want the profile kernels to be formulated such that their input generalizes the Euclidean distance – a distance that equals the Euclidean distance if no values are missing. Formally, we will use a profile kernel

(A9)

applies and, to fit better with the rest of our notation, we rewrite Eq. (A7) to

Similarly, we rewrite the density for incomplete data from Eq. (A8) with to

Considering the definition of the density in Definition 3, the density warp kernel of Abdallah and Shimshoni (2014) is

(A10)

1.5.3 A.5.3 Reformulating the search warp kernel

To reformulate the search warp kernel in common terminology, we build on the definitions

from which follows

(A11)

with which we reformulate Eq. (A5) to

(A12)

Next, we discuss the constant . For that, we require a condition that links \(_{{\textsf {MED}}} {\textit{w}}_{\omega }\) and . Keeping Eq. (5) in mind, we compare \(_{{\textsf {MED}}} {\textit{w}}_{\omega }\) and with

(A13)

and

(A14)

We recognize that can be chosen arbitrarily, as long as (i) is a constant (among other things, must not depend on \({i} \)), (ii) , and (iii) Eq. (A14) holds; in other words, . This means, the relationship between \({}_{{\textsf {MED}}} {\textit{w}}_{\omega }\) and depends on the choice of – and vice versa. Following this, we first make a reasonable decision on the relationship first and determine afterwards. Similarly to Eq. (A14), the derivative of the tube warp kernel for the density estimation (Eq. (18)), namely

is proportional to the tube warp kernel for mean shift search, namely \(\hat{{\textit{v}}}_{i}\) (the two tube warp kernel – for the density and mean shift search – are identical). Eq. (A14) and (19) motivate us to define the relationship as

Definition 6

(Relationship warp kernels) Similar to Eq. (5), namely

we define

which provides

and thus, using , an

under which our condition Definition 5 holds true.

1.6 A.6 Proof of Eqs. (10) and (11)

Eqs. (10) and (11) follow directly from Eqs. (A6), (7) and (A12): Instead of , we use the total mean Euclidean distance . This results in in Eq. (11), instead of in Eq. (A12). Also, this results in Eq. (10), instead of Eq. (A6).

Appendix B: Proof: Eqs. (7) and (20) are equal

The binary operation of convolution \(*\) is defined as

where , and each of f and g can be a function or Schwartz distribution (also known as a generalized function). The (a Schwartz distribution) is the neutral element of convolution, meaning ; in addition,

(B15)

Abdallah and Shimshoni (2014) define (p. 4-6) the squared mean Euclidean distance as

(B16)

where is the marginal PDF of the respective feature. In the following we show that it is possible to write the definition more compactly as

(B17)

(B18)

(B19)

In addition, we define

(B20)

Proof

We prove that Eq. (B17) is equal to the first line of Eq. (B16) if :

(B21)

(B22)

(B23)

(B24)

\(\square \)

Proof

Next, we prove that Eq. (B17) is equal to the third line of Eq. (B16) if . The four equations from Eq. (B23) to Eq. (B24) only assume and provide, with Eq. (B20),

(B25)

\(\square \)

Proof

The proof that Eq. (B17) is equal to the second line of Eq. (B16) if is analogous to the last proof. \(\square \)

Proof

Finally, we prove that Eq. (B17) is equal to the fourth line of Eq. (B16) if .

(B26)

\(\square \)

Appendix C: Details of performance measures

Several methods for measuring clustering effectiveness exist. Because internal measures, such as the inertia or silhouette coefficient, are unsuitable for evaluating non-convex cluster families, we used 12 external measures on reference labels. Three large groups of external measures exist: (i) those that examine if object pairs are in the same group as in the reference, (ii) those that build on classical information theory, and (iii) those adopted from supervised learning. Of the first group, we employed the Rand index (RI), adjusted Rand index (ARI), and Fowlkes-Mallows index (FMI); of the second, normalized mutual information (NMI), and adjusted mutual information (AMI); of the third, accuracy. For NMI and AMI, we tested four (min, max, mean, geometric) aggregator functions.

Our final choice of measures is motivated by our own findings and by literature, e.g., by Romano et al. (2016): While the specific values can vary between those measures, they usually rank the methods in the same way per dataset, as substantiated in Fig. 5 and 8 . The advantage of ARI and AMI is that they correct for chance: They approach zero for random labeling for an increasing number of objects. In contrast, unadjusted measures approach one. FMI adjusts for chance, but is considerably less strict, which reduces its explanatory power considerably. AMI’s min and geometric aggregators are not reliable for particularly small or large and thus cannot be used for parameterization, so we use the classic max aggregator (mean aggregator is very similar in results).

Appendix D: Synthetic datasets with known label structure (Sect. 4.2): Additional measurements

In the paper, we presented experiments with the four labeled benchmark datasets aggregationB01, R15, moons, and flame. Fig. 8 displays additional results for those datasets: Here we display the adjusted Rand index and the adjusted mutual information for the cases where we consider (i) only incomplete objects, (ii) all objects, including the outliers. When we consider only incomplete objects, the difference is not too large to when we consider all non-outliers. This is because many objects are outliers as seen on the top x-axis. When we consider all objects, including outliers, the performance of our methods is mostly the same as before, however other methods are not too far off anymore. The reason is simply that those other methods are able to label the outliers as outliers, but are not able to recover the main clusters as such, as soon as missing values are introduced. Thus, their seemingly good results are due to their ability to detect outliers, not their ability to cluster correctly.

Appendix E: Real-life data (Sect. 4.3): Distribution of each measure for each incompleteness type

To perform a significance test, we want to determine the range, the , in which a measure will fall with 95% probability. Thus, we need to know the distribution of each measure (ARI, AMI, accuracy) for each incompleteness type (MCAR, MAR, MNAR):

Steinley et al. (2016) showed that a normal distribution approximates the ARI very closely, especially for the construction of confidence intervals. However, they studied the ARI for a constant number of (reference) clusters, a constant number of objects in each cluster, and under the assumption of a null model for randomness that is constructed from a generalized hypergeometric distribution (Hubert and Arabie 1985). This so-called , or , generates clusters uniformly and randomly via permutations of labels, as described by Romano et al. (2014). They, and others, study the AMI under the same assumptions. Dietterich (1998) describes proportions of misclassifications to be normally distributed, which translates to accuracy.

While those results provide some evidence to use a normal distribution for our case, the assumptions/scenarios are not the same. Thus, to add experimental evidence, we created about 50 different datasets from the real-life data of Sect. 4.3 by randomly deleting 20% of the values for each incompleteness type. This produced nine measure samples (each with 50 measurements) – one for each combination of ARI, AMI, accuracy with each incompleteness type.

The Cullen and Frey graphs in Fig. 9 display the kurtosis and the square of skewness for each of the nine distributions of the measure samples as well as some theoretical distributions. Additionally, for each measure sample, we used bootstrapping to create 50 resampled samples, which are also displayed in the graphs in terms of their distributions’ kurtosis and square of skewness. The graphs show that the measure sample distributions have a kurtosis close to 3 and are not or very little skewed. Tab. 1 reports exact values. This suggests that the measure distributions can be modeled with normal distributions.

Next, we min-max normalized the measure samples into the range [0, 1] and fitted different theoretical distributions to the normalized samples. The QQ plots in Fig. 10 compare the quantiles of the normalized measure sample distributions with the quantiles of the theoretical distributions. The plots show that the normal and the logistic distributions model the measures well.

Finally, we performed hypothesis tests for different theoretical distributions: For each measure and incompleteness type, the Tab. 2 provides the p-values of the Anderson-Darling hypothesis test for different distributions. The Anderson-Darling hypothesis tests show that each measure for each incompleteness type is either modeled best by the normal or the logistic distribution. Tab. 2 shows such large p-values that we cannot reject the null hypothesis that the measure samples follow a normal or logistic distribution.

When plotting the 95% confidence intervals based on those distributions, they are indistinguishable for variances between 0 and 1. This is the case here, because the AMI, ARI, and accuracy can only take values between 0 and 1.

Table 1 Real-life data (Sect. 4.3): Kurtosis and skew for each measure sample

Full size table

Table 2 Real-life data (Sect. 4.3): p values of the Anderson-Darling hypothesis test for different distributions, for AMI, ARI, accuracy and each incompleteness type, for datasets with 20% missing values

Full size table

Appendix F: Real-life data (Sect. 4.3): Statistical significance and model variance

The experiments on the real-life data of Sect. 4.3 in the main document demonstrated that the competing methods are statistically significantly worse than our method. Now, we can additionally formulate (and later reject) the null hypothesis that our methods are not better than the competitors. For that, we determine the standard deviation of the respective measures of the competitors, not just for our methods. We calculate the range in which a measure of the respective method will fall with 95% probability. In Fig. 11, we plot the ranges. The lower bars in Fig. 11 denote the threshold for the 0.025 p-value. This is twice as strict as the p-value threshold of 0.05, yet we can reject the null hypothesis – most of the time by a large margin – and claim that our methods are significantly better than our method.

There is the conspicuous exception that we cannot reject the null hypothesis for BRMM. However, this is due to BRMM ’s high model variance: repeated application of the algorithm results in very different results. Thus, one should reject the conclusion that our method is not better, but take the additional evidence into account that other methods manage to keep the model variance low. In fact, our methods are particularly consistent in their results with excellent low model variance, as demonstrated by the error bars in Fig. 11.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Leibrandt, R., Günnemann, S. Generalized density attractor clustering for incomplete data. Data Min Knowl Disc 37, 970–1009 (2023). https://doi.org/10.1007/s10618-022-00904-6

Download citation

Received: 11 February 2022
Accepted: 25 November 2022
Published: 18 January 2023
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10618-022-00904-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Generalized density attractor clustering for incomplete data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting drifts in data streams using Kullback-Leibler (KL) divergence measure for data engineering applications

Applying INAR-Hidden Markov Chains in the Analysis of Under-Reported Data

Concept drift detection and localization framework based on behavior replacement

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: in-depth discussion of msMED and proof that Eqs. (7) to (13) are equal to the original formulation

1.1 A.1 Terminology

1.2 A.2 Derivation of the mean Euclidean distance of Eqs. (7)

1.3 A.3 Discussion of errors and issues in previous work

1.4 A.4 Proof of Eqs. (8) and (12)

1.5 A.5 Proof of Eqs. (9) and (13)

1.5.1 A.5.1 Original formulation of the densities in Abdallah and Shimshoni (2014)

1.5.2 A.5.2 Reformulating the density and density warp kernel

Definition 5

1.5.3 A.5.3 Reformulating the search warp kernel

Definition 6

1.6 A.6 Proof of Eqs. (10) and (11)

Appendix B: Proof: Eqs. (7) and (20) are equal

Proof

Proof

Proof

Proof

Appendix C: Details of performance measures

Appendix D: Synthetic datasets with known label structure (Sect. 4.2): Additional measurements

Appendix E: Real-life data (Sect. 4.3): Distribution of each measure for each incompleteness type

Appendix F: Real-life data (Sect. 4.3): Statistical significance and model variance

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now