Abstract
Transactional data are ubiquitous. Several methods, including frequent itemset mining and co-clustering, have been proposed to analyze transactional databases. In this work, we propose a new research problem to succinctly summarize transactional databases. Solving this problem requires linking the high level structure of the database to a potentially huge number of frequent itemsets. We formulate this problem as a set covering problem using overlapped hyperrectangles (a concept generally regarded as tile according to some existing papers); we then prove that this problem and its several variations are NP-hard, and we further reveal its relationship with the compact representation of a directed bipartite graph. We develop an approximation algorithm Hyper which can achieve a logarithmic approximation ratio in polynomial time. We propose a pruning strategy that can significantly speed up the processing of our algorithm, and we also propose an efficient algorithm Hyper+ to further summarize the set of hyperrectangles by allowing false positive conditions. Additionally, we show that hyperrectangles generated by our algorithms can be properly visualized. A detailed study using both real and synthetic datasets shows the effectiveness and efficiency of our approaches in summarizing transactional databases.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Afrati FN, Gionis A, Mannila H (2004) Approximating a collection of frequent sets. In: KDD, pp 12–19
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499
Agrawal R, Borgida A, Jagadish HV (1989) Efficient management of transitive relationships in large data, knowledge bases. In: SIGMOD Conference, pp 253–262
Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: SIGMOD Conference, pp 207–216
Agrawal A, Mannila H, Srikant R, Toivonen H, Verkamo A (1996) Fast discovery of association rules. Adv Knowl Discov Data Min 307–308
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, pp 94–105
Besson J, Robardet C, De Raedt L, Boulicaut J-F (2006) Mining bi-sets in numerical data. In: KDID, pp 11–23
Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Min Knowl Discov 7(1): 5–22
Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T (2005) Mafia: a maximal frequent itemset algorithm. IEEE Trans Knowl Data Eng 17(11): 1490–1504
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206
Chandola V, Kumar V (2007) Summarization—compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378
Chvátal V. (1979) A greedy heuristic for the set-covering problem. Math Oper Res 4: 233–235
Faloutsos C, Megalooikonomou V (2007) On data mining, compression, kolmogorov complexity. Data Min Knowl Discov 15(1): 3–20
Gao Byron J, Ester M (2006) Turning clusters into patterns: rectangle-based discriminative data description. In: ICDM, pp 200–211
Gao Byron J, Ester M, Cai JY, Schulte O, Xiong H (2007) The minimum consistent subset cover problem, its applications in data mining. In: KDD, pp 310–319
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Discovery science, pp 278–289
Gionis A, Mannila H, Seppänen JK (2004) Geometric, combinatorial tiles in 0-1 data. In: PKDD, pp 173–184
Han J, Kamber M (2006) Data mining: concepts, techniques, second edition. Morgan Kaufmann, San Francisco
Harper LH (1964) Optimal assignments of numbers to vertices. J Soc Ind Appl Math 12(1): 131–135
Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337): 123–129
Jin R, Xiang Y, Fuhry D, Dragan FF (2008) Overlapping matrix pattern visualization: a hypergraph approach. In: ICDM, pp 313–322 (© IEEE, 2008. doi:10.1109/ICDM.2008.102)
Jin R, Xiang Y, Liu L (2009) Cartesian contour: a concise representation for a collection of frequent sets. In: KDD, pp 417–426
Johnson D, Krishnan S, Chhugani J, Kumar S, Venkatasubramanian S (2004) Compressing large boolean matrices using reordering techniques. In: VLDB, pp 13–23
Kellerer H, Pferschy U, Pisinger D (2004) Knapsack problems. Springer Verlag, New York
Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson TJ (2002) The generalized mdl approach for summarization. In: VLDB, pp 766–777
Li T (2005) A general model for clustering binary data. In: KDD, pp 188–197
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1): 24–45
Minoux M (1977) Accelerated greedy algorithms for maximizing submodular set functions. In: the 8th IFIP Conference on Optimization Techniques
Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Publishers, Boston
Peeters R (2003) The maximum edge biclique problem is np-complete. Discret Appl Math 131(3): 651–654
Pei J, Dong G, Zou W, Han J (2004) Mining condensed frequent-pattern bases. Knowl Inf Syst 6(5): 570–594
Richardson TJ, Urbanke RL (2008) Modern coding theory. Cambridge University Press, Cambridge
Safro I, Ron D, Brandt A (2006) Graph minimum linear arrangement by multilevel weighted edge contractions. J Algorithms 60(1): 24–41
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: SDM
Steinbach M, Tan P-N, Kumar V (2004) Support envelopes: a technique for exploring the structure of association patterns. In: KDD, pp 296–305
van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks item sets that matter. In: PKDD, pp 585–592
Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: KDD, pp 765–774
Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37
Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: KDD, pp 758–766 (© ACM, 2008. http://doi.acm.org/10.1145/1401890.1401981)
Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: VLDB, pp 709–720
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Bart Goethals.
A preliminary version of this article appeared in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Xiang et al. 2008). This submission has substantially extended the previous paper and contains new and major-value added contribution in comparison with the conference publication.
Rights and permissions
About this article
Cite this article
Xiang, Y., Jin, R., Fuhry, D. et al. Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Disc 23, 215–251 (2011). https://doi.org/10.1007/s10618-010-0203-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-010-0203-9