Abstract
Many resource-intensive analytics processes evolve over time following new versions of the reference datasets and software dependencies they use. We focus on scenarios in which any version change has the potential to affect many outcomes, as is the case for instance in high throughput genomics where the same process is used to analyse large cohorts of patient genomes, or cases. As any version change is unlikely to affect the entire population, an efficient strategy for restoring the currency of the outcomes requires first to identify the scope of a change, i.e., the subset of affected data products. In this paper we describe a generic and reusable provenance-based approach to address this scope discovery problem. It applies to a scenario where the process consists of complex hierarchical components, where different input cases are processed using different version configurations of each component, and where separate provenance traces are collected for the executions of each of the components. We show how a new data structure, called a restart tree, is computed and exploited to manage the change scope discovery problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alper, P., Belhajjame, K., Curcin, V., Goble, C.: LabelFlow framework for annotating workflow provenance. Informatics 5(1), 11 (2018)
Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the Kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). https://doi.org/10.1007/11890850_14
Angelino, E., Yamins, D., Seltzer, M.: StarFlow: a script-centric data analysis environment. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 236–250. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17819-1_27
Bavoil, L., et al.: VisTrails: enabling interactive multiple-view visualizations. In: VIS 05. IEEE Visualization, 2005, No. Dx, pp. 135–142. IEEE (2005)
Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Gener. Comput. Syst. 65, 153–168 (2016)
Cała, J., Missier, P.: Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study. Big Data Res. (2018). https://doi.org/10.1016/j.bdr.2018.06.001. ISSN 2214-5796
Cuevas-Vicenttín, V., et al.: ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance (2016)
Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. In: Proceedings of the 2006 International Conference on Provenance and Annotation of Data, pp. 10–18 (2006)
Herschel, M., Diestelkämper, R., Ben Lahmar, H.: A survey on provenance: what for? what form? what from? VLDB J. 26(6), 1–26 (2017)
Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 877–888. IEEE (2013)
Koop, D., Scheidegger, C.E., Freire, J., Silva, C.T.: The provenance of workflow upgrades. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds.) IPAW 2010. LNCS, vol. 6378, pp. 2–16. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17819-1_2
Lakhani, H., Tahir, R., Aqil, A., Zaffar, F., Tariq, D., Gehani, A.: Optimized rollback and re-computation. In: 2013 46th Hawaii International Conference on System Sciences, No. I, pp. 4930–4937. IEEE (Jan 2013)
Moreau, L., et al.: PROV-DM: the PROV data model. Technical report, World Wide Web Consortium (2012)
Pimentel, J.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: a tool for collecting, analyzing, and managing provenance from python scripts. Proc. VLDB Endow. 10(12), 1841–1844 (2017)
Woodman, S., Hiden, H., Watson, P.: Applications of provenance in performance prediction and data storage optimisation. Future Gener. Comput. Syst. 75, 299–309 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Cała, J., Missier, P. (2018). Provenance Annotation and Analysis to Support Process Re-computation. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-98379-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98378-3
Online ISBN: 978-3-319-98379-0
eBook Packages: Computer ScienceComputer Science (R0)