Abstract
This paper presents a PhD research project focused on investigating the use of Kolmogorov complexity approximations as descriptors for various data types, with the aim of addressing inversion problems. The research explores the application of these approximations across different domains while considering the relationship between algorithmic and probabilistic complexities. The study starts with genomic data analysis, where specialized data compressors are employed to improve taxonomic identification, classification, and organization. The research then extends to analysing artistic paintings, utilizing information-based measures to attribute authorship, categorize styles, and describe the content. Additionally, the research examines Turing Machine-generated data, providing insights into the relationship between algorithmic and probabilistic complexities. A method for increasing probabilistic complexity without affecting algorithmic complexity is also proposed. Lastly, a methodology for identifying programs capable of generating outputs approximating given input strings is introduced, offering potential solutions to inversion problems. The paper highlights this research's diverse applications and findings, contributing to understanding the relationship between algorithmic and probabilistic complexities in data analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hutter, M.: Algorithmic information theory. Scholarpedia 2, 2519 (2007)
Nalbantoglu, Ö., Russell, D., Sayood, K.: Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12(1), 34–52 (2009). https://doi.org/10.3390/e12010034
Silva, M., Pratas, D., Pinho, A.J.: Efficient DNA sequence compression with neural networks. GigaScience 9(11), giaa119 (2020). https://doi.org/10.1093/gigascience/giaa119
MacKay, D.J.C., Mac Kay, D.J.C.: Information theory, inference and learning algorithms. Cambridge University Press, Cambridge (2003)
Cohen, A.R., Bjornsson, C.S., Temple, S., Banker, G., Roysam, B.: Automatic summarization of changes in biological image sequences using algorithmic information theory. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1386–1403 (2009). https://doi.org/10.1109/TPAMI.2008.162
Maurer, U.: Information-theoretic cryptography. In: Wiener, M. (ed.) Advances in Cryptology — CRYPTO’ 99. LNCS, vol. 1666, pp. 47–65. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_4
Yeboah-Ofori, A., Agbodza, C.K., Opoku-Boateng, F.A., Darvishi, I., Sbai, F.: Applied cryptography in network systems security for cyberattack prevention. In: 2021 International Conference on Cyber Security and Internet of Things (ICSIoT), pp. 43–48 (2021)
Tenreiro Machado, J., Lopes, A.M.: Artistic painting: a fractional calculus perspective. Appl. Math. Model. 65, 614–626 (2019)
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and its Applications. TCS, Springer, New York (2008). https://doi.org/10.1007/978-0-387-49820-1
Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68(25), 3805–3808 (1992). https://doi.org/10.1103/PhysRevLett.68.3805
Ciliberti, S., Martin, O.C., Wagner, A.: Innovation and robustness in complex regulatory gene networks. Proc. Natl. Acad. Sci. 104(34), 13591–13596 (2007). https://doi.org/10.1073/pnas.0705396104
Adami, C., Ofria, C., Collier, T.C.: Evolution of biological complexity. Proc. Natl. Acad. Sci. 97(9), 4463–4468 (2000). https://doi.org/10.1073/pnas.97.9.4463
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
RossQuinlan, J., Rivest, R.L.: Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989). https://doi.org/10.1016/0890-5401(89)90010-2
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979). https://doi.org/10.1109/TPAMI.1979.4766909
Dodis, Y., Reyzin, L., Smith, A.: Fuzzy extractors: how to generate strong keys from biometrics and other noisy data. In: Cachin, C., Camenisch, J.L. (eds.) Advances in Cryptology - EUROCRYPT 2004. LNCS, vol. 3027, pp. 523–540. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24676-3_31
Chaitin, G.J.: Algorithmic information theory. IBM J. Res. Dev. 21(4), 350–359 (1977)
Bruce, S.: Applied cryptography: Protocols, Algorthms, and Source Code in c.-2nd (1996)
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Probl. Inf. Trans. 1(1), 1–7 (1965)
Chaitin, G.J.: On the length of programs for computing finite binary sequences: statistical considerations. J. ACM (JACM). 16(1), 145–159 (1969)
Calude, C.S.: Information and Randomness: An Algorithmic Perspective. Springer Science & Business Media, Heidelberg (2002). https://doi.org/10.1007/978-3-662-03049-3
Sayood, K.: Introduction. In: Introduction to data compression, pp. 1–10. Elsevier (2018). https://doi.org/10.1016/B978-0-12-809474-7.00001-X
Moffat, A.: Word-based text compression. Softw. Pract. Exp. 19(2), 185–198 (1989). https://doi.org/10.1002/spe.4380190207
Knoll, B., de Freitas, N.: A machine learning perspective on predictive coding with PAQ8. In: 2012 Data Compression Conference, pp. 377–386. IEEE (2012)
Carrasco, R.C., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Carrasco, R.C., Oncina, J. (eds.) Grammatical Inference and Applications. LNCS, vol. 862, pp. 139–152. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58473-0_144
Silva, J.M., Pratas, D., Caetano, T., Matos, S.: The complexity landscape of viral genomes. GigaScience 11, 1–16 (2022). https://doi.org/10.1093/gigascience/giac079
Silva, J.M., Pratas, D., Antunes, R., Matos, S., Pinho, A.J.: Automatic analysis of artistic paintings using information-based measures. Pattern Recogn. 114, 107864 (2021). https://doi.org/10.1016/j.patcog.2021.107864
Wallace, C.S.: Minimum message length and kolmogorov complexity. Comput. J. 42(4), 270–283 (1999). https://doi.org/10.1093/comjnl/42.4.270
Hutter, M.: Universal algorithmic intelligence: a mathematical top→down approach. In: Goertzel, B., Pennachin, C. (eds.) Artificial general intelligence, pp. 227–290. Springer Berlin Heidelberg, Heidelberg (2007). https://doi.org/10.1007/978-3-540-68677-4_8
Silva, J.M., Almeida, J.R.: The value of compression for taxonomic identification. In: 2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS), pp. 276–281. IEEE (2022)
Zenil, H., Delahaye, J.-P.: An algorithmic information theoretic approach to the behaviour of financial markets. J. Econ. Surv. 25(3), 431–463 (2011)
Silva, J.M., Pratas, D., Caetano, T., Matos, S.: Feature-based classification of archaeal sequences using compression-based methods. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds.) Pattern Recognition and Image Analysis, pp. 309–320. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-04881-4_25
jorgeMFS. Complexity ANalysis VirAl Sequences (C.A.N.V.A.S.) Repository (2021). https://github.com/jorgeMFS/canvas
jorgeMFS. Classification and identification of Archaea (ARCHAEA2) Repository (2021). https://github.com/jorgeMFS/Archaea2
bioinformatics ua. COMPressor tAxonomic ClassificaTion (C.O.M.P.A.C.T.) Repository (2021). https://github.com/bioinformatics-ua/COMPACT
CANVAS Website. CANVAS Website (2021). https://asilab.github.io/canvas/
asilab. Measuring probabilistic-algorithmic information of artistic paintings (PANTHER) Repository (2021). https://github.com/asilab/panther
PANTHER Website. PANTHER Website (2021). http://panther.web.ua.pt/
Silva, J.M., Pinho, E., Matos, S., Pratas, D.: Statistical complexity analysis of turing machine tapes with fixed algorithmic complexity using the best-order Markov model. Entropy. 22(1), 105 (2020)
asilab. TMCompression Repository (2021). https://github.com/asilab/TMCompression
jorgeMFS.Turing Machine Recreator (TMRecreator) (2021). https://github.com/jorgeMFS/TMRecreator
jorgeMFS. SPTTM (2021). https://github.com/jorgeMFS/spttm
bioinformatics ua. TM Neural Finder (2021). https://github.com/bioinformatics-ua/TM-Neural-Finder
Acknowledgments
This work was partially funded by National Funds through the FCT - Foundation for Science and Technology, in the context of the project UIDB/00127/2020. Furthermore, this work has received funding from the EC under grant agreement 101081813, Genomic Data Infrastructure. J.M.S. acknowledges the FCT grant SFRH/BD/141851/2018. National funds funded D.P. through FCT – Fundação para a Ciência e a Tecnologia, I.P., under the Scientific Employment Stimulus - Institutional Call - reference CEECINST/00026/2018.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, J.M., Pratas, D., Matos, S. (2023). Exploring Kolmogorov Complexity Approximations for Data Analysis: Insights and Applications. In: Camarinha-Matos, L.M., Ferrada, F. (eds) Technological Innovation for Connected Cyber Physical Spaces. DoCEIS 2023. IFIP Advances in Information and Communication Technology, vol 678. Springer, Cham. https://doi.org/10.1007/978-3-031-36007-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-36007-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36006-0
Online ISBN: 978-3-031-36007-7
eBook Packages: Computer ScienceComputer Science (R0)