Abstract
Sequential pattern mining has been introduced by Agrawal and Srikant (in: Proceedings of ICDE’95, pp 3–14, 1995) 2 decades ago, and its usefulness has been widely proved for different mining tasks and application fields such as web usage mining, text mining, bioinformatics, fraud detection and so on. Since 1995, despite numerous optimization proposals, sequential pattern mining remains a costly task that often generates too many patterns. This limit, also reached by itemset mining, was circumvented by pattern sampling. Pattern sampling is a non-exhaustive method for instantly discovering relevant patterns that ensures a good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach investigated for different kinds of patterns including itemsets and subgraphs has not yet been applied to sequential patterns. In this paper, we propose the first method dedicated to sequential pattern sampling. In addition to address sequential data, the originality of our approach is to introduce a class of interestingness measures relying on the norm of the sequence, named norm-based utilities. In particular, it enables to add constraints on the norm of sampled patterns to control the length of the drawn patterns and to avoid the pitfall of the “long tail” where the rarest patterns flood the user. We propose a new two-step random procedure integrating this class of measures, named \({\textsc {NUSSampling}}\) that randomly draws sequential patterns according to frequency weighted by a norm-based utility. We demonstrate that this method performs an exact sampling according to the underlying measure. Moreover, despite the use of rejection sampling, the experimental study shows that \({\textsc {NUSSampling}}\) remains efficient. We especially focus on the interest of norm constraints and exponential decays that help to draw general patterns of the “head”. We also illustrate how to benefit from these sampled patterns to instantly build an associative classifier dedicated to sequences. This classification approach rivals state-of-the-art proposals showing the interest of sequential pattern sampling with norm-based utility.











Similar content being viewed by others
Notes
By convention, we consider that \(\left( {\begin{array}{c}n\\ p\end{array}}\right) = 0\) if \(p > n\).
The datasets bms and sign are available at http://www.philippe-fournier-viger.com/spmf and other ones at http://www.mybytes.de/#data.
The dataset reuters is available at ana.cachopo.org/datasets-for-single-label-text-categorization and other ones, at www.mybytes.de/#data.
We recall that for every sequence s, we define s[K] as: \(s[K]= \cap _{k \in K} s[k].\)
References
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of ICDE’95, pp 3–14
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of EDBT vol 96, pp 3–17
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60
Pei J, Han J, Mortazavi-Asl B, Pinto H (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of ICDE 2001, pp 215–224
Garofalakis MN, Rastogi R, Shim K (1999) Spirit: sequential pattern mining with regular expression constraints. VLDB 99:7–10
Pei J, Han J, Lakshmanan LV (2001) Mining frequent itemsets with convertible constraints. In: Proceedings of ICDE 2001. IEEE, pp 433–442
Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: Proceedings of ICDE 2004. IEEE, pp 79–90
Yan X, Han J, Afshar R (2003) Clospan: mining: closed sequential patterns in large datasets. In: Proceedings of SDM 2003. SIAM, pp 166–177
Bosc G, Boulicaut J-F, Raïssi C, Kaytoue M (2016) Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min Knowl Discov 32:1–47
Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB 2(1):730–741
Bhuiyan M, Mukhopadhyay S, Hasan MA (2012) Interactive pattern mining on hidden data: a sampling-based solution. In: Proceedings of CIKM 2012, pp 95–104
Giacometti A, Soulet A (2017) Interactive pattern sampling for characterizing unlabeled data. In: Proceedings of IDA 2017, pp 99–111
Dzyuba V, Mv L, Nijssen S, De Raedt L (2014) Interactive learning of pattern rankings. Int J Artif Intell Tools 23(06):32
Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3–4):119–130
Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 534–546
Anderson C (2004) The long tail. Wired Mag 12(10):170–177
Diop L, Diop CT, Giacometti A, Li D, Soulet A (2018) Sequential pattern sampling with norm constraints. In: IEEE international conference on data mining (ICDM)
Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of SIGKDD 2011, pp 582–590
van Leeuwen M (2014) Interactive data exploration using pattern mining. In: Andreas Holzinger, Igor Jurisica (eds)Interactive knowledge discovery and data mining in biomedical informatics. Springer, Berlin, pp 169–182
Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73
Hu Q, Imielinski T (2017) Alpine: progressive itemset mining with definite guarantees. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 63–71
He Z, Xu X, Huang ZJ, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118
Toivonen H et al (1996) Sampling large databases for association rules. Proc VLDB 96(96):134–145
Luo C, Chung SM (2004) A scalable algorithm for mining maximal frequent sequences using sampling. In: Proceedings of ICTAI 2004. IEEE, pp 156–165
Raissi C, Poncelet P (2007) Sampling for sequential pattern mining: from static databases to data streams. In: Proceedings of ICDM 2007, pp 631–636
Bendimerad AA, Plantevit M, Robardet C (2016) Unsupervised exceptional attributed sub-graph mining in urban data. In: Proceedings of ICDM 2016. IEEE, pp 21–30
Giacometti A, Soulet A (2018) Dense neighborhood pattern sampling in numerical data. In: Proceedings of SDM 2018, pp 756–764
Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of SDM 2010. SIAM, pp 177–188
Li G, Zaki MJ (2012) Sampling minimal frequent boolean (DNF) patterns. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 87–95
Moens S, Goethals B (2013) Randomly sampling maximal itemsets. In: Proceedings of IDEA workshop 2013, pp 79–86
Dzyuba V, van Leeuwen M, De Raedt L (2017) Flexible constrained sampling with guarantees for pattern mining. Data Min Knowl Discov 31(5):1266–1293
Gueguen M, Sentieys O, Termier A (2019) Accelerating itemset sampling using satisfiability constraints on FPGA. In: IEEE/ACM design, automation and test in Europe (DATE)
Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 69–77
Moens S, Boley M (2014) Instant exceptional model mining using weighted controlled pattern sampling. In: International symposium on intelligent data analysis. Springer, Berlin, pp 203–214
Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Min Knowl Discov 29(3):732–764
Egho E, Gay D, Boullé M, Voisine N, Clérot F (2017) A user parameter-free approach for mining robust sequential classification rules. Knowl Inf Syst 52(1):53–81
Fournier-Viger P, Gomariz A, Gueniche T, Mwamikazi E, Thomas R (2013) Tks: efficient mining of top-k sequential patterns. In: International conference on advanced data mining and applications. Springer, Berlin, pp 109–120
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: International conference on extending database technology. Springer, Berlin 1–17
Arimura H, Uno T (2009) Polynomial-delay and polynomial-space algorithms for mining closed sequences, graphs, pictures in accessible set systems. In: Proceedings of the 2009 SIAM international conference on data mining. SIAM, pp 1088–1099
Acknowledgements
This work has been partly supported by the CEA-MITIC (Centre d’Excellence Africain en Mathématiques, Informatique et TIC).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Diop, L., Diop, C.T., Giacometti, A. et al. Sequential pattern sampling with norm-based utility. Knowl Inf Syst 62, 2029–2065 (2020). https://doi.org/10.1007/s10115-019-01417-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01417-3