Pattern Discovery in Colored Strings

Pattern Discovery in Colored Strings

Authors Zsuzsanna Lipták , Simon J. Puglisi , Massimiliano Rossi



PDF
Thumbnail PDF

File

LIPIcs.SEA.2020.12.pdf
  • Filesize: 2.58 MB
  • 14 pages

Document Identifiers

Author Details

Zsuzsanna Lipták
  • Department of Computer Science, University of Verona, Italy
Simon J. Puglisi
  • Department of Computer Science, University of Helsinki, Finland
Massimiliano Rossi
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Acknowledgements

We thank Johannes Fischer, Travis Gagie, and Ferdinando Cicalese for interesting discussions, and Alessandro Danese for providing an updated data set of traces.

Cite As Get BibTex

Zsuzsanna Lipták, Simon J. Puglisi, and Massimiliano Rossi. Pattern Discovery in Colored Strings. In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 12:1-12:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.SEA.2020.12

Abstract

We consider the problem of identifying patterns of interest in colored strings. A colored string is a string in which each position is colored with one of a finite set of colors. Our task is to find substrings that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically infer properties of the embedded system from the analysis of its simulation traces. We show that the number of interesting patterns is upper-bounded by 𝒪(n²) where n is the length of the string. We introduce a baseline algorithm with 𝒪(n²) running time which identifies all interesting patterns for all colors in the string satisfying certain minimality conditions. When one is interested in patterns related to only one color, we provide an algorithm that identifies patterns in 𝒪(n²log n) time, but is faster than the first algorithm in practice, both on simulated and on real-world patterns.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • property testing
  • suffix tree
  • pattern mining

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Proc. 20th International Conference on Very Large Data Bases (VLDB), volume 1215, pages 487-499. Morgan Kaufmann, 1994. Google Scholar
  2. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proc. Eleventh International Conference on Data Engineering (ICDE), volume 95, pages 3-14. IEEE Computer Society, 1995. Google Scholar
  3. Franc Brglez, David Bryan, and Krzysztof Kozminski. Combinational profiles of sequential benchmark circuits. In Proc. 1989 IEEE international symposium on circuits and systems (ISACS), volume 3, pages 1929-1934. IEEE, 1989. Google Scholar
  4. Chung-Wen Cho, Ying Zheng, Yi-Hung Wu, and Arbee LP Chen. A tree-based approach for event prediction using episode rules over event streams. In Proc. International Conference on Database and Expert Systems Applications (DEXA 2008), volume 5181 of LNCS, pages 225-240. Springer, 2008. Google Scholar
  5. David Clark. Compact PAT trees. PhD thesis, University of Waterloo, 1997. Google Scholar
  6. Fulvio Corno, Matteo Sonza Reorda, and Giovanni Squillero. Rt-level itc'99 benchmarks and first atpg results. IEEE Design & Test of computers, 17(3):44-53, 2000. Google Scholar
  7. Alessandro Danese, Nicolò Dalla Riva, and Graziano Pravadelli. A-team: Automatic template-based assertion miner. In Proc. 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1-6. IEEE, 2017. Google Scholar
  8. Alessandro Danese, Tara Ghasempouri, and Graziano Pravadelli. Automatic extraction of assertions from execution traces of behavioural models. In Proc. 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 67-72. IEEE, 2015. Google Scholar
  9. Jasbir Dhaliwal, Simon J Puglisi, and Andrew Turpin. Practical efficient string mining. IEEE Transactions on Knowledge and Data Engineering, 24(4):735-744, 2010. Google Scholar
  10. Lina Fahed, Armelle Brun, and Anne Boyer. DEER: Distant and Essential Episode Rules for early prediction. Expert Systems with Applications, 93:283-298, 2018. Google Scholar
  11. Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465-492, 2011. Google Scholar
  12. Johannes Fischer, Volker Heun, and Stefan Kramer. Fast frequent string mining using suffix arrays. In Proc. Fifth IEEE International Conference on Data Mining (ICDM 2005), pages 609-612. IEEE, 2005. Google Scholar
  13. Johannes Fischer, Volker Heun, and Stefan Kramer. Optimal string mining under frequency constraints. In Proc. European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2006), volume 4213 of LNCS, pages 139-150. Springer, 2006. Google Scholar
  14. Johannes Fischer, Veli Mäkinen, and Niki Välimäki. Space efficient string mining under frequency constraints. In Proc. Eighth IEEE International Conference on Data Mining (ICDM 2008), pages 193-202. IEEE, 2008. Google Scholar
  15. Harry D Foster, Adam C Krolnik, and David J Lacey. Assertion-based design. Springer Science & Business Media, 2004. Google Scholar
  16. Philippe Fournier-Viger, Jerry Chun-Wei Lin, Rage Uday Kiran, Yun Sing Koh, and Rincy Thomas. A survey of sequential pattern mining. Data Science and Pattern Recognition, 1(1):54-77, 2017. Google Scholar
  17. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), volume 8504 of LNCS, pages 326-337. Springer, 2014. Google Scholar
  18. Dan Gusfield. Algorithms on Strings, Trees, and Sequences : Computer Science and Computational Biology. Cambridge University Press, Cambridge, United Kingdom, 1997. Google Scholar
  19. Koji Iwanuma, Ryuichi Ishihara, Yoh Takano, and Hidetomo Nabeshima. Extracting frequent subsequences from a single long data sequence a novel anti-monotonic measure and a simple on-line algorithm. In Proc. Fifth IEEE International Conference on Data Mining (ICDM 2005), pages 186-195. IEEE, 2005. Google Scholar
  20. Xiaonan Ji, James Bailey, and Guozhu Dong. Mining minimal distinguishing subsequence patterns with gap constraints. Knowledge and Information Systems, 11(3):259-286, 2007. Google Scholar
  21. Srivatsan Laxman, Vikram Tankasali, and Ryen W White. Stream prediction using a generative model based on frequent episodes in event sequences. In Proc. 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2008), pages 453-461. ACM, 2008. Google Scholar
  22. Lingyi Liu, David Sheridan, Viraj Athavale, and Shobha Vasudevan. Automatic generation of assertions from system level design using data mining. In Proc. Ninth ACM/IEEE International Conference on Formal Methods and Models for Codesign, pages 191-200. IEEE Computer Society, 2011. Google Scholar
  23. Nizar R Mabroukeh and Christie I Ezeife. A taxonomy of sequential pattern mining algorithms. ACM Computing Surveys (CSUR), 43(1):3, 2010. Google Scholar
  24. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design. CUP, 2015. Google Scholar
  25. Heikki Mannila, Hannu Toivonen, and A Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data mining and knowledge discovery, 1(3):259-289, 1997. Google Scholar
  26. OpenCores. Available at https://opencores.org/. Accessed 05-03-2019. URL: https://opencores.org/.
  27. Tinghai Pang, Lei Duan, Jesse Li-Ling, and Guozhu Dong. Mining Similarity-Aware Distinguishing Sequential Patterns from Biomedical Sequences. In Proc. Second International Conference on Data Science in Cyberspace (DSC 2017), pages 43-52. IEEE, 2017. Google Scholar
  28. Jian Pei, Jiawei Han, and Wei Wang. Constraint-based sequential pattern mining: the pattern-growth methods. Journal of Intelligent Information Systems, 28(2):133-160, 2007. Google Scholar
  29. Robert Sedgewick and Kevin Wayne. Algorithms. Addison-Wesley Professional, 2011. Google Scholar
  30. Bill Smyth. Computing Patterns in Strings. Pearson Addison-Wesley, Essex, England, 2003. Google Scholar
  31. Niko Välimäki and Simon J Puglisi. Distributed string mining for high-throughput sequencing data. In Proc. 12th International Workshop on Algorithms in Bioinformatics (WABI 2012), volume 7534 of LNCS, pages 441-452. Springer, 2012. Google Scholar
  32. Shobha Vasudevan, David Sheridan, Sanjay Patel, David Tcheng, Bill Tuohy, and Daniel Johnson. Goldmine: Automatic assertion generation using data mining and static analysis. In Proc. 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), pages 626-629. IEEE, 2010. Google Scholar
  33. Xianming Wang, Lei Duan, Guozhu Dong, Zhonghua Yu, and Changjie Tang. Efficient mining of density-aware distinguishing sequential patterns with gap constraints. In International Conference on Database Systems for Advanced Applications (DASFAA 2014), volume 8421 of LNCS, pages 372-387. Springer, 2014. Google Scholar
  34. Xindong Wu, Xingquan Zhu, Yu He, and Abdullah N Arslan. PMBC: Pattern mining from biological sequences with wildcard constraints. Computers in Biology and Medicine, 43(5):481-492, 2013. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail