{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,19]],"date-time":"2025-03-19T15:28:42Z","timestamp":1742398122421,"version":"3.38.0"},"reference-count":57,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2024,10,4]],"date-time":"2024-10-04T00:00:00Z","timestamp":1728000000000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["SW"],"published-print":{"date-parts":[[2024,10,4]]},"abstract":"Artificial intelligence systems are not simply built on a single dataset or trained model. Instead, they are made by complex data science workflows involving multiple datasets, models, preparation scripts, and algorithms. Given this complexity, in order to understand these AI systems, we need to provide explanations of their functioning at higher levels of abstraction. To tackle this problem, we focus on the extraction and representation of data journeys from these workflows. A data journey is a multi-layered semantic representation of data processing activity linked to data science code and assets. We propose an ontology to capture the essential elements of a data journey and an approach to extract such data journeys. Using a corpus of Python notebooks from Kaggle, we show that we are able to capture high-level semantic data flow that is more compact than using the code structure itself. Furthermore, we show that introducing an intermediate knowledge graph representation outperforms models that rely only on the code itself. Finally, we report on a user survey to reflect on the challenges and opportunities presented by computational data journeys for explainable AI.<\/jats:p>","DOI":"10.3233\/sw-233407","type":"journal-article","created":{"date-parts":[[2023,6,16]],"date-time":"2023-06-16T14:59:03Z","timestamp":1686927543000},"page":"1057-1083","source":"Crossref","is-referenced-by-count":4,"title":["Data journeys: Explaining AI workflows through abstraction"],"prefix":"10.1177","volume":"15","author":[{"given":"Enrico","family":"Daga","sequence":"first","affiliation":[{"name":"The Open University, United Kingdom"}]},{"given":"Paul","family":"Groth","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Netherlands"}]}],"member":"179","reference":[{"key":"10.3233\/SW-233407_ref1","doi-asserted-by":"publisher","DOI":"10.1145\/3460210.3493578"},{"key":"10.3233\/SW-233407_ref3","unstructured":"ACM US Public Policy Council, Statement on algorithmic transparency and accountability, 2017."},{"key":"10.3233\/SW-233407_ref4","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.449"},{"key":"10.3233\/SW-233407_ref5","doi-asserted-by":"crossref","unstructured":"S.\u00a0Al Manir, J.\u00a0Niestroy, M.A.\u00a0Levinson and T.\u00a0Clark, Evidence graphs: Supporting transparent and FAIR computation, with defeasible reasoning on data, methods, and results, in: Provenance and Annotation of Data and Processes, Springer, 2020, pp.\u00a039\u201350.","DOI":"10.1007\/978-3-030-80960-7_3"},{"key":"10.3233\/SW-233407_ref6","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445736"},{"key":"10.3233\/SW-233407_ref8","doi-asserted-by":"crossref","unstructured":"M.\u00a0Atzeni and M.\u00a0Atzori, CodeOntology: RDF-ization of source code, in: International Semantic Web Conference, Springer, 2017, pp.\u00a020\u201328.","DOI":"10.1007\/978-3-319-68204-4_2"},{"key":"10.3233\/SW-233407_ref9","doi-asserted-by":"publisher","first-page":"82","DOI":"10.1016\/j.inffus.2019.12.012","article-title":"Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI","volume":"58","author":"Barredo Arrieta","year":"2020","journal-title":"Information Fusion"},{"key":"10.3233\/SW-233407_ref10","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1016\/j.websem.2015.01.003","article-title":"Using a suite of ontologies for preserving workflow-centric research objects","volume":"32","author":"Belhajjame","year":"2015","journal-title":"Journal of Web Semantics"},{"key":"10.3233\/SW-233407_ref12","unstructured":"S.\u00a0Chari, D.M.\u00a0Gruen, O.\u00a0Seneviratne and D.L.\u00a0McGuinness, Directions for explainable knowledge-enabled systems, in: Knowledge Graphs for EXplainable Artificial Intelligence: Foundations, Applications and Challenges, IOS Press, 2020, pp.\u00a0245\u2013261."},{"key":"10.3233\/SW-233407_ref13","unstructured":"E.\u00a0Daga, E.\u00a0Blomqvist, A.\u00a0Gangemi, E.\u00a0Montiel, N.\u00a0Nikitina, V.\u00a0Presutti and B.\u00a0Villaz\u00f3n-Terrazas, D2.5.2 Pattern Based Ontology Design: Methodology and Software Support, 2008."},{"key":"10.3233\/SW-233407_ref14","doi-asserted-by":"crossref","unstructured":"E.\u00a0Daga, M.\u00a0d\u2019Aquin, A.\u00a0Adamou and E.\u00a0Motta, Addressing exploitability of smart city data, in: 2016 IEEE International Smart Cities Conference (ISC2), IEEE, 2016, pp.\u00a01\u20136.","DOI":"10.1109\/ISC2.2016.7580764"},{"key":"10.3233\/SW-233407_ref16","doi-asserted-by":"crossref","unstructured":"E.\u00a0Daga, M.\u00a0d\u2019Aquin, A.\u00a0Gangemi and E.\u00a0Motta, Propagation of policies in rich data flows, in: Proceedings of the 8th International Conference on Knowledge Capture, 2015, pp.\u00a01\u20138.","DOI":"10.1145\/2815833.2815839"},{"key":"10.3233\/SW-233407_ref17","doi-asserted-by":"crossref","unstructured":"E.\u00a0Daga, M.\u00a0d\u2019Aquin and E.\u00a0Motta, Propagating data policies: A user study, in: Proceedings of the Knowledge Capture Conference, 2017, pp.\u00a01\u20138.","DOI":"10.1145\/3148011.3148022"},{"issue":"2","key":"10.3233\/SW-233407_ref18","doi-asserted-by":"publisher","first-page":"163","DOI":"10.3233\/SW-170266","article-title":"Reasoning with data flows and policy propagation rules","volume":"9","author":"Daga","year":"2018","journal-title":"Semantic Web"},{"key":"10.3233\/SW-233407_ref19","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.5770310"},{"key":"10.3233\/SW-233407_ref21","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.139"},{"key":"10.3233\/SW-233407_ref22","doi-asserted-by":"publisher","first-page":"338","DOI":"10.1016\/j.future.2013.09.018","article-title":"Common motifs in scientific workflows: An empirical analysis","volume":"36","author":"Garijo","year":"2014","journal-title":"Future Generation Computer Systems"},{"key":"10.3233\/SW-233407_ref23","doi-asserted-by":"publisher","DOI":"10.1145\/3533028.3533303"},{"key":"10.3233\/SW-233407_ref24","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00726-w"},{"key":"10.3233\/SW-233407_ref25","unstructured":"S.\u00a0Grafberger, J.\u00a0Stoyanovich and S.\u00a0Schelter, Lightweight inspection of data preprocessing in native machine learning pipelines, in: 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, Online Proceedings, January 11\u201315, 2021, www.cidrdb.org, 2021, http:\/\/cidrdb.org\/cidr2021\/papers\/cidr2021_paper27.pdf."},{"issue":"6","key":"10.3233\/SW-233407_ref26","doi-asserted-by":"publisher","first-page":"881","DOI":"10.1007\/s00778-017-0486-1","article-title":"A survey on provenance: What for? What form? What from?","volume":"26","author":"Herschel","year":"2017","journal-title":"The VLDB Journal"},{"key":"10.3233\/SW-233407_ref27","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300830"},{"key":"10.3233\/SW-233407_ref30","doi-asserted-by":"publisher","DOI":"10.1145\/2452376.2452475"},{"key":"10.3233\/SW-233407_ref32","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2020\/726"},{"key":"10.3233\/SW-233407_ref33","doi-asserted-by":"publisher","first-page":"43","DOI":"10.1016\/j.websem.2015.01.001","article-title":"The data mining optimization ontology","volume":"32","author":"Keet","year":"2015","journal-title":"Journal of web semantics"},{"key":"10.3233\/SW-233407_ref34","doi-asserted-by":"publisher","DOI":"10.1093\/gigascience\/giz095"},{"key":"10.3233\/SW-233407_ref36","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-37177-7_1"},{"key":"10.3233\/SW-233407_ref37","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-37177-7"},{"key":"10.3233\/SW-233407_ref39","unstructured":"R.\u00a0Liepin\u0161, M.\u00a0Grasmanis and U.\u00a0Bojars, OWLGrEd ontology visualizer, in: Proceedings of the 2014 International Conference on Developers, Vol.\u00a01268, CEUR-WS.org, 2014, pp.\u00a037\u201342."},{"key":"10.3233\/SW-233407_ref40","doi-asserted-by":"publisher","DOI":"10.1145\/3236386.3241340"},{"key":"10.3233\/SW-233407_ref41","doi-asserted-by":"publisher","DOI":"10.1145\/3329486.3329489"},{"key":"10.3233\/SW-233407_ref42","unstructured":"S.M.\u00a0Lundberg and S.-I.\u00a0Lee, A unified approach to interpreting model predictions, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS\u201917, Curran Associates Inc., Red Hook, NY, USA, 2017, pp.\u00a04768\u20134777. ISBN 9781510860964."},{"key":"10.3233\/SW-233407_ref43","doi-asserted-by":"publisher","DOI":"10.1145\/3387166"},{"key":"10.3233\/SW-233407_ref44","doi-asserted-by":"crossref","unstructured":"L.\u00a0Moreau, The Foundations for Provenance on the Web, Now Publishers Inc, 2010.","DOI":"10.1561\/9781601983879"},{"issue":"4","key":"10.3233\/SW-233407_ref46","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1145\/1330311.1330323","article-title":"The provenance of electronic data","volume":"51","author":"Moreau","year":"2008","journal-title":"Communications of the ACM"},{"key":"10.3233\/SW-233407_ref47","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1902.01876"},{"key":"10.3233\/SW-233407_ref48","doi-asserted-by":"publisher","first-page":"71","DOI":"10.1007\/978-3-319-16462-5_6","volume-title":"noWorkflow: Capturing and Analyzing Provenance of Scripts","author":"Murta","year":"2015"},{"key":"10.3233\/SW-233407_ref49","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403205"},{"issue":"1","key":"10.3233\/SW-233407_ref50","doi-asserted-by":"publisher","first-page":"87","DOI":"10.1016\/0004-3702(82)90012-1","article-title":"The knowledge level","volume":"18","author":"Newell","year":"1982","journal-title":"Artificial intelligence"},{"key":"10.3233\/SW-233407_ref51","doi-asserted-by":"publisher","DOI":"10.1145\/3184900"},{"issue":"5","key":"10.3233\/SW-233407_ref52","doi-asserted-by":"publisher","first-page":"1222","DOI":"10.1007\/s10618-014-0363-0","article-title":"Ontology of core data mining entities","volume":"28","author":"Panov","year":"2014","journal-title":"Data Mining and Knowledge Discovery"},{"issue":"3","key":"10.3233\/SW-233407_ref53","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1109\/MCSE.2007.53","article-title":"IPython: A system for interactive scientific computing","volume":"9","author":"P\u00e9rez","year":"2007","journal-title":"Computing in Science and Engineering"},{"key":"10.3233\/SW-233407_ref54","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517886"},{"key":"10.3233\/SW-233407_ref55","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24794-1_3"},{"key":"10.3233\/SW-233407_ref57","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1606.05386"},{"key":"10.3233\/SW-233407_ref58","doi-asserted-by":"crossref","unstructured":"P.\u00a0Ristoski and H.\u00a0Paulheim, Rdf2vec: Rdf graph embeddings for data mining, in: International Semantic Web Conference, Springer, 2016, pp.\u00a0498\u2013514.","DOI":"10.1007\/978-3-319-46523-4_30"},{"key":"10.3233\/SW-233407_ref59","doi-asserted-by":"crossref","unstructured":"S.\u00a0Samuel, F.\u00a0L\u00f6ffler and B.\u00a0K\u00f6nig-Ries, Machine learning pipelines: Provenance, reproducibility and FAIR data principles, in: Provenance and Annotation of Data and Processes, Springer, 2020, pp.\u00a0226\u2013230.","DOI":"10.1007\/978-3-030-80960-7_17"},{"key":"10.3233\/SW-233407_ref61","first-page":"1","article-title":"Semantic web technologies for explainable machine learning models: A literature review","volume":"2465","author":"Seeliger","year":"2019","journal-title":"PROFILES\/SEMEX@ ISWC"},{"issue":"2","key":"10.3233\/SW-233407_ref62","doi-asserted-by":"publisher","first-page":"172","DOI":"10.1108\/DTA-04-2021-0106","article-title":"A review of data mining ontologies","volume":"56","author":"Sinha","year":"2022","journal-title":"Data Technologies and Applications"},{"key":"10.3233\/SW-233407_ref63","doi-asserted-by":"publisher","DOI":"10.3233\/DS-210053"},{"issue":"12","key":"10.3233\/SW-233407_ref65","doi-asserted-by":"publisher","first-page":"3474","DOI":"10.14778\/3415478.3415570","article-title":"Responsible data management","volume":"13","author":"Stoyanovich","year":"2020","journal-title":"Proc. VLDB Endow."},{"key":"10.3233\/SW-233407_ref66","doi-asserted-by":"publisher","DOI":"10.1109\/VL\/HCC50065.2020.9127207"},{"key":"10.3233\/SW-233407_ref67","unstructured":"I.\u00a0Tiddi et al., Foundations of explainable knowledge-enabled systems, Knowl. Graph. eXplainable Artif. Intell.: Found. Appl. Challenges 47 (2020), 23."},{"key":"10.3233\/SW-233407_ref68","doi-asserted-by":"crossref","unstructured":"I.\u00a0Tolovski, S.\u00a0D\u017eeroski and P.\u00a0Panov, Semantic annotation of predictive modelling experiments, in: International Conference on Discovery Science, Springer, 2020, pp.\u00a0124\u2013139.","DOI":"10.1007\/978-3-030-61527-7_9"},{"key":"10.3233\/SW-233407_ref69","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445728"},{"key":"10.3233\/SW-233407_ref70","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2103.06312"},{"key":"10.3233\/SW-233407_ref71","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1323"}],"container-title":["Semantic Web"],"original-title":[],"link":[{"URL":"https:\/\/content.iospress.com\/download?id=10.3233\/SW-233407","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,11]],"date-time":"2025-03-11T07:37:13Z","timestamp":1741678633000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.3233\/SW-233407"}},"subtitle":[],"editor":[{"given":"Roberto","family":"Confalonieri","sequence":"additional","affiliation":[{"name":"University of Padua, Italy"}]},{"given":"Oliver","family":"Kutz","sequence":"additional","affiliation":[{"name":"Free University of Bozen-Bolzano, Italy"}]},{"given":"Diego","family":"Calvanese","sequence":"additional","affiliation":[{"name":"Ume\u00e5 University, Sweden"},{"name":"Free University of Bozen-Bolzano, Italy"}]},{"given":"Jose M.","family":"Alonso","sequence":"additional","affiliation":[{"name":"University of Santiago de Compostela, CiTIUS, Spain"}]},{"given":"Shang-Ming","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Plymouth, UK"}]}],"short-title":[],"issued":{"date-parts":[[2024,10,4]]},"references-count":57,"journal-issue":{"issue":"4"},"URL":"https:\/\/doi.org\/10.3233\/sw-233407","relation":{},"ISSN":["2210-4968","1570-0844"],"issn-type":[{"type":"electronic","value":"2210-4968"},{"type":"print","value":"1570-0844"}],"subject":[],"published":{"date-parts":[[2024,10,4]]}}}