Abstract
Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects × attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative “counter-relations” between words, i.e. words which “steer clear” one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bavaud, F.: Modèles et données: une introduction à la Statistique uni-, bi- et trivariée. L’Harmattan (1998)
Benzécri, J.: Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Les Cahiers de l’Analyse des Données, pp. 208–218 (1982)
Cadot, M.: A Simulation Technique for extracting Robust Association Rules. In: CSDA 2005 (2005)
Cadot, M.: Extraire et valider les relations complexes en sciences humaines: statistiques, motifs et règles d’association. Ph.D. thesis, Université de Franche-Comté (2006)
Cadot, M., Napoli, A.: Une optimisation de l’extraction d’un jeu de règles s’appuyant sur les caractéristiques statistiques des données. In: RSTI, série RIA-ECA, pp. 631–656 (2003)
Cobb, G., Chen, Y.: An application of Markov chain Monte Carlo to community ecology. The American Mathematical Monthly, pp. 264–288 (2003)
Connor, E., Simberloff, D.: The assembly of species communities: Chance or competition? Ecology, 1132–1140 (1979)
Delaunay, B.: Sur la sphère vide. Izvestia Akademii Nauk SSSR, pp. 793–800 (1934)
Droesbeke, J., Finne, J.: Inférence non-paramétrique - Les statistiques de rangs. Editions de l’Université de Bruxelles (1996)
Fisher, R.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 179–188 (1936)
Gabriel, K., Sokal, R.: A new statistical approach to geographic variation analysis. Systematic Zoology, 259–270 (1969)
Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data (2007)
Goodman, J., O’Rourke, J.: Handbook of Discrete and Computational Geometry. CRC Press, Boca Raton (2004)
Jensen, D., Cohen, P.: Multiple Comparisons in Induction Algorithms. Machine Learning, 309–338 (2000)
Lelu, A.: Analyse en composantes locales et graphes de similarité entre textes. In: Purnelle, G. (ed.) JADT 2004 (2004)
Lelu, A., Cuxac, P., Cadot, M.: Document stream clustering: an optimal and fine-grained incremental approach. In: COLLNET 2006 / International Workshop on Webometrics, Informetrics and Scientometrics (2006)
Lerman, I.-C., Peter, P.: Indice probabiliste de vraisemblance du lien entre objets quelconques. Analyse comparative entre deux approches. Revue de Statistique Appliquée, pp. 5–35 (2003)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 361–397 (2004)
Manly, B.: Randomization, Bootstrap and Monte Carlo methods in Biology. Chapman and Hall/CRC (1997)
Morineau, A., Nakache, J.-P., Krzyzanowski, C.: Le modèle log-linéaire et ses applications. Cisia-Ceresta (1996)
Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 323–351 (2005)
Pons, P., Latapy, M.: Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications, 191–218 (2006)
Press, J.: The role of Bayesian and frequentist multivariate modeling in statistical Data Mining. Statistical Data Mining and Knowledge Discovery, 1–14 (2004)
Ryser, H.: Recent Advances in Matrix Theory, Madison (1964)
Scuturici, M., Clech, J., Scuturici, V.-M., Zighed, D.: Topological Representation Model for Image Database Query. Journal of Experimental and Theoretical Artificial Intelligence, 145–160 (2005)
Snijders, T.: Enumeration and simulation methods for 0-1 matrices with given marginals. Psychometrika, 397–417 (2004)
Toussaint, G.: The relative neighbourhood graph of a finite planar set. Pattern Recognition, 261–268 (1980)
Watts, D., Strogatz, S.: Collective dynamics of ’small-world’ networks. Nature, 95–118 (1998)
Yates, F.: Contingency table involving small numbers and the Chi2 test. Journal of the Royal statistical society (supplement), 217–235 (1934)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Lelu, A., Cadot, M. (2010). Statistically Valid Links and Anti-links BetweenWords and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00580-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-00580-0_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00579-4
Online ISBN: 978-3-642-00580-0
eBook Packages: EngineeringEngineering (R0)