Probabilistic k-Median Clustering in Data Streams

Lammersen, Christiane; Schmidt, Melanie; Sohler, Christian

doi:10.1007/s00224-014-9539-7

Probabilistic k-Median Clustering in Data Streams

Published: 12 April 2014

Volume 56, pages 251–290, (2015)
Cite this article

Theory of Computing Systems Aims and scope Submit manuscript

Christiane Lammersen¹,
Melanie Schmidt² &
Christian Sohler²

335 Accesses
5 Citations
Explore all metrics

Abstract

The focus of our work is introducing and constructing probabilistic coresets. A probabilistic coreset can contain probabilistic points, and the number of these points should be polylogarithmic in the input size. However, the overall storage size is also influenced by representation size of the propability distribution of each point. So, our first observation is that the size of probabilistic coresets shall be restricted in the number of points and in the representation size of the points. We propose the first (k, ε)-coreset constructions for the probabilistic k-median problem in the metric and Euclidean case. The coresets are of size poly(ε ⁻¹, k, log(W/(p _min⋅δ))), where W is the expected total weight of the weighted probabilistic input points when all weights are scaled to be at least one, p _min is the probability of a point to be realized at a certain location, and δ is the error probability of the construction. Our coreset for the Euclidean problem can be maintained in data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Algorithms for k-median Clustering over Distributed Streams

Accurate Low-Space Approximation of Metric k-Median for Insertion-Only Streams

Improvements on approximation algorithms for clustering probabilistic data

Article 30 August 2021

Notes

¹ And, as remarked in the contribution paragraph, algorithms for the metric case cannot be used due to the difference between finitely many center candidates and center candidates from the (infinite) Euclidean space.

References

Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)
Article MATH MathSciNet Google Scholar
Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009)
Article Google Scholar
Arora, S.: Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J. ACM 45(5), 753–782 (1998)
Article MATH MathSciNet Google Scholar
Arora, S., Raghavan, P., Rao, S.: Approximation schemes for Euclidean k-medians and related problems. In: Proceedings 30th STOC, pp. 106–113 (1998)
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33(3), 544–562 (2004)
Article MATH MathSciNet Google Scholar
Bȧdoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings 34th STOC, pp. 250–257 (2002)
Bentley, J.L., Saxe, J.B.: Decomposable searching problems I: Static-to-dynamic transformation. J. Algorithm. 1(4), 301–358 (1980)
Article MATH MathSciNet Google Scholar
Charikar, M., Guha, S.: Improved combinatorial algorithms for facility location problems. SIAM J. Comput 34(4), 803–824 (2005)
Article MATH MathSciNet Google Scholar
Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)
Article MATH MathSciNet Google Scholar
Chau, M., Cheng, R., Kao, B., Ng, J.: Uncertain data mining: an example in clustering location data. In: Proceedings 10th PAKDD, pp. 199–204 (2006)
Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput. 39(3), 923–947 (2009)
Article MATH MathSciNet Google Scholar
Cormode, G., McGregor, A.: Approximation algorithms for clustering uncertain data. In: Proceedings 27th PODS, pp. 191–200 (2008)
Edmonds, J., Karp, R.M.: Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM 19(2), 248–264 (1972)
Article MATH Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings 2nd ACM SIGKDD, pp. 226–231 (1996)
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings 23rd SoCG, pp. 11–18 (2007)
Forgey, E.: Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 768(21) (1965)
Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proceedings 37th STOC, pp. 209–217 (2005)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
Article Google Scholar
Guha, S., Munagala, K.: Exceeding expectations and clustering uncertain data. In: Proceedings 28th PODS, pp. 269–278 (2009)
Günnemann, S., Kremer, H., Seidl, T.: Subspace clustering for uncertain data. In: Proceedings SIAM International Conference on Data Mining, pp. 385–396 (2010)
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discret. Comput. Geom. 37(1), 3–19 (2007)
Article MATH MathSciNet Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings 36th STOC, pp. 291–300 (2004)
Haussler, D.: Decision theoretic generalizations of the pac model for neural net and other learning applications. Inf. Comput. 100(1), 78–150 (1992)
Article MATH MathSciNet Google Scholar
Indyk, P.: Sublinear time algorithms for metric space problems. In: Proceedings 31st STOC, pp. 428–434 (1999)
Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: Proceedings 34th STOC, pp. 731–740 (2002)
Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings 40th FOCS, pp. 2–13 (1999)
Kolliopoulos, S.G., Rao, S.: A nearly linear-time approximation scheme for the Euclidean k-median problem. SIAM J. Comput. 37(3), 757–782 (2007)
Article MATH MathSciNet Google Scholar
Kriegel, H.P., Pfeifle, M.: Density-based clustering of uncertain data. In: Proceedings 11th ACM SIGKDD, pp. 672–677 (2005)
Kriegel, H.P., Pfeifle, M.: Hierarchical density-based clustering of uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 689–692 (2005)
Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57(2) (2010)
Lloyd, S.P.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–136 (1982)
Google Scholar
Mettu, R.R., Plaxton, C.G.: Optimal time bounds for approximate clustering. Mach. Learn. 56(1-3), 35–60 (2004)
Article MATH Google Scholar
Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient clustering of uncertain data. In: Proceedings 6th IEEE ICDM, pp. 436–445 (2006)
Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: Proceedings 6th ICCV, pp. 59–66 (1998)
Xu, H., Li, G.: Density-based probabilistic clustering of uncertain data. In: Proceedings 1st CSSE, vol. 4, pp. 474–477 (2008)

Download references

Acknowledgments

The authors thank the anonymous referees for their detailed and useful comments, especially for suggesting to try to extend Theorem 2 to the non-uniform case and for pointing out that the proof can be shortened.

Author information

Authors and Affiliations

Simon Fraser University, Burnaby, BC, Canada
Christiane Lammersen
TU Dortmund University, Dortmund, Germany
Melanie Schmidt & Christian Sohler

Authors

Christiane Lammersen
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Christian Sohler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Melanie Schmidt.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lammersen, C., Schmidt, M. & Sohler, C. Probabilistic k-Median Clustering in Data Streams. Theory Comput Syst 56, 251–290 (2015). https://doi.org/10.1007/s00224-014-9539-7

Download citation

Published: 12 April 2014
Issue Date: January 2015
DOI: https://doi.org/10.1007/s00224-014-9539-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic k-Median Clustering in Data Streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Algorithms for k-median Clustering over Distributed Streams

Accurate Low-Space Approximation of Metric k-Median for Insertion-Only Streams

Improvements on approximation algorithms for clustering probabilistic data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Probabilistic k-Median Clustering in Data Streams

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Algorithms for k-median Clustering over Distributed Streams

Accurate Low-Space Approximation of Metric k-Median for Insertion-Only Streams

Improvements on approximation algorithms for clustering probabilistic data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation