Abstract
Big Data or Data-Intensive applications (DIAs) seek to mine, manipulate, extract or otherwise exploit the potential intelligence hidden behind Big Data. However, several practitioner surveys remark that DIAs potential is still untapped because of very difficult and costly design, quality assessment and continuous refinement. To address the above shortcoming, we propose the use of a UML domain-specific modeling language or profile specifically tailored to support the design, assessment and continuous deployment of DIAs. This article illustrates our DIA-specific profile and outlines its usage in the context of DIA performance engineering and deployment. For DIA performance engineering, we rely on the Apache Hadoop technology, while for DIA deployment, we leverage the TOSCA language. We conclude that the proposed profile offers a powerful language for data-intensive software and systems modeling, quality evaluation and automated deployment of DIAs on private or public clouds.
Similar content being viewed by others
Notes
TOSCA is a language to specify deployable blueprints in line with the emerging Infrastructure-as-Code (IasC) paradigm [18].
Modeling and Analysis of Real-Time Embedded Systems.
Dependability Analysis and Modeling.
The DIA library is described in the technical “Appendix B”.
See “Appendix B” for details on data types.
In Fig. 5, stereotypes with dark gray background have been taken from MARTE and the light gray ones from DAM.
Yet Another Resource Negotiator.
We say “at least” because we use Erlang-k distributions for the firing times, which are possible to be represented in CTMC, although increasing even further the number of states in function of the number of Erlang-k transitions and the value of k.
References
Ajmone-Marsan, M., Balbo, G., Conte, G., Donatelli, S., Franceschinis, G.: Modeling with Generalized Stochastic Petri Nets. Wiley, New York (1994)
Ardagna, D., Bernardi, S., Gianniti, E., Karimian Aliabadi, S., Perez-Palacin, D., Requeno, J.I.: Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. In: International Conference on Algorithms and Architectures for Parallel Processing, pp. 599–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49583-5_47
Ardagna, D., Di Nitto, E., Casale, G., Petcu, D., Mohagheghi, P., Mosser, S., Matthews, P., Gericke, A., Ballagny, C., D’Andria, F., Nechifor, C.-S., Sheridan, C.: Modaclouds: a model-driven approach for the design and execution of applications on multiple clouds. In: Proceedings of the 4th International Workshop on Modeling in Software Engineering, MiSE’12, pp. 50–56. IEEE Press, Piscataway, NJ (2012). http://dl.acm.org/citation.cfm?id=2664431.2664439
Artac, M., Borovsak, T., Di Nitto, E., Guerriero, M., Perez-Palacin, D., Tamburri, D.A.: Infrastructure-as-code for data-intensive architectures: a model-driven development approach. In: IEEE International Conference on Software Architecture, ICSA 2018, Seattle, WA, April 30–May 4, 2018, pp. 156–165. IEEE Computer Society (2018). https://doi.org/10.1109/ICSA.2018.00025
ATC. Athens Technology Center Website (2018). https://www.atc.gr/default.aspx?page=home. Accessed Dec 2018
Baresi, L., Guinea, S., Quattrocchi, G., Tamburri, D.A.: Microcloud: A container-based solution for efficient resource management in the cloud. In: 2016 IEEE International Conference on Smart Cloud (SmartCloud), pp. 218–223, Nov 2016. https://doi.org/10.1109/SmartCloud.2016.42
Bell, G., Hey, T., Szalay, A.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009)
Bernardi, S., Dominguez, J.L., Gómez, A., Joubert, C., Merseguer, José, Perez-Palacin, D., Requeno, J.I., Romeu, A.: A systematic approach for performance assessment using process mining. Empir. Softw. Eng. (2018) (accepted for publication). https://doi.org/10.1007/s10664-018-9606-9
Bernardi, S., Requeno, J.I., Joubert, C., Romeu, A.: A systematic approach for performance evaluation using process mining: the Posidonia Operations case study. In: Proceedings of the 2nd International Workshop on Quality-Aware DevOps, QUDOS 2016, pp. 24–29. ACM, New York, NY (2016). https://doi.org/10.1145/2945408.2945413
Bernardi, S., Merseguer, J., Petriu, D.C.: A dependability profile within MARTE. Softw. Syst. Model. 10(3), 313–336 (2011)
Bernardi, S., Merseguer, J., Petriu, D.C.: Model-Driven Dependability Assessment of Software Systems. Springer, New York (2013)
Blu Age. Blu Age, Make IT Digital (2018). https://www.bluage.com. Accessed Dec 2018
Casale et al., G.: DICE: Quality-driven development of data-intensive cloud applications. In: Proceedings of the Seventh International Workshop on Modeling in Software Engineering, pp. 78–83, IEEE Press, NJ (2015). http://dl.acm.org/citation.cfm?id=2820489.2820507
Chandrasekaran, K., Santurkar, S., Arora, A.: Stormgen—a domain specific language to create ad-hoc storm topologies. In: Ganzha, M., Maciaszek, L.A., Paprzycki, M. (eds.) FedCSIS, pp. 1621–1628 (2014). http://dblp.uni-trier.de/db/conf/fedcsis/fedcsis2014.html#ChandrasekaranSA14
Chen, C.L.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Chiola, G., Dutheillet, C., Franceschinis, G., Haddad, S.: Stochastic well-formed colored nets and symmetric modeling applications. IEEE Trans. Comput. 42(11), 1343–1360 (1993). https://doi.org/10.1109/12.247838
Clements, P., Kazman, R., Klein, M.: Evaluating Software Architectures: Methods and Case Studies. Addison-Wesley, Boston (2001)
Cois, C.A., Yankel, J., Connell, A.: Modern devops: optimizing software development through effective system interactions. In: IPCC, pp. 1–7. IEEE (2014). http://dblp.uni-trier.de/db/conf/ipcc/ipcc2014.html#CoisYC14
Colas, M., Finck, I., Buvat, J., Nambiar, R., Singh, R.R.: Cracking the data conundrum: how successful companies make big data operational. Technical report, Capgemini consulting (2015). https://www.capgemini-consulting.com/cracking-the-data-conundrum
Cortellessa, V., Di Marco, A., Inverardi, P.: Model-Based Software Performance Analysis. Springer, New York (2011)
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Di Nitto, E., Mattew, P., Petcu, D., Solberg, A. (eds.): Model-Driven Development and Operation of Multi-Cloud Applications. PoliMI SpringerBriefs. Springer, New York (2017)
Dipartamento di informatica, Università di Torino. GRaphical Editor and Analyzer for Timed and Stochastic Petri Nets, Dec 2015. www.di.unito.it/~greatspn/index.html
Gilmore, S., Hillston, J., Kloul, L., Ribaudo, M.: Pepa nets: a structured performance modelling formalism. Perform. Eval. 54(2), 79–104 (2003). https://doi.org/10.1016/S0166-5316(03)00069-5
Gómez, A., Merseguer, J., Di Nitto, E., Tamburri, D.A.: Towards a uml profile for data intensive applications. In: Proceedings of the 2Nd International Workshop on Quality-Aware DevOps, QUDOS 2016, pp. 18–23, ACM, New York, NY (2016). https://doi.org/10.1145/2945408.2945412
Juniper Project: Experimental: models for big data stream processing (2015). Juniper Project Tutorial. http://forge.modelio.org/projects/juniper/wiki/Tutorial_on_Models_for_Big_Data_stream_processing. Accessed Dec 2018
Kroß, J., Brunnert, A., Krcmar, H.: Modeling big data systems by extending the palladio component model. Softwaretechnik-Trends 35(3) (2015)
Kroß, J., Krcmar, H.: Modeling and simulating Apache Spark streaming applications. Softwaretechnik-Trends 36(4), 1–3 (2016)
Lagarde, F., Espinoza, H., Terrier, F., Gérard, S.: Improving UML profile design practices by leveraging conceptual domain models. In: 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), Atlanta (USA), ACM, Nov 2007, pp. 445–448
Langheinrich, M.: Privacy by design. In: Abowd, G.D., Brumitt, B., Shafer, A. (eds.) UBICOMP 2001, pp. 273–291. Springer, New York (2001)
Lazowska, E.D., Zahorjan, J., Scott Graham, G., Sevcik, C.: Quantitative System Performance: Computer System Analysis Using Queueing Network models. Prentice-Hall, Upper Saddle River (1984)
Lipton, P., Palma, D., Rutkowski, M., Tamburri, D.A.: TOSCA solves big problems in the cloud and beyond. IEEE Cloud 21(11), 31–39 (2016)
López-Grao, J.P., Merseguer, J., Campos, J.: From UML activity diagrams to stochastic petri nets: application to software performance engineering. In: Proceedings of the 4th International Workshop on Software and Performance, WOSP’04, pp. 25–36, ACM, New York, NY (2004). https://doi.org/10.1145/974044.974048
Morris, K.: Infrastructure As Code: Managing Servers in the Cloud. Oreilly & Associates Incorporated, Sebastopol (2016)
Palma, D., Rutkowski, M., Spatzier, T.: Tosca simple profile in YAML version 1.0. Technical report, OASIS Committee Specification (2016). http://docs.oasis-open.org/tosca/TOSCA-Simple-Profile-YAML/v1.0/cs01/TOSCA-Simple-Profile-YAML-v1.0-cs01.html
Perez-Palacin, D, Ridene, Y., Merseguer, J.: Quality assessment in DevOps: automated analysis of a tax fraud detection system. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion, ICPE’17 Companion, pp. 133–138, ACM, New York, NY (2017)
Petriu, D.C., Alhaj, M., Tawhid, R.: Software Performance Modeling. Lecture Notes in Computer Science, vol. 7320. Springer, Berlin (2012)
Prodevelop: Prodevelop-Integrating Tech (2018). https://www.prodevelop.es/en. Accessed Dec 2018
Rajbhoj, A., Kulkarni, V., Bellarykar, N.: Early experience with model-driven development of MapReduce based big data application. In: 2014 21st Asia-Pacific Software Engineering Conference (APSEC), vol. 1, pp. 94–97 (Dec 2014). https://doi.org/10.1109/APSEC.2014.23
Ranjan, R.: Modeling and simulation in performance optimization of big data processing frameworks. IEEE Cloud Comput. 1(4), 14–19 (2014)
Requeno, J.I., Merseguer, J., Bernardi, S., Perez-Palacin, D., Giotis, G., Papanikolaou, V.: Quantitative analysis of apache storm applications: the NewsAsset case study. Inf. Syst. Front. (2018) (accepted for publication). https://doi.org/10.1007/s10796-018-9851-x
Requeno, J.-I., Merseguer, J., Bernardi, S.: Performance analysis of apache storm applications using stochastic petri nets. In: IEEE International Conference on Information Reuse and Integration (IRI), pp. 411–418 (2017). http://ieeexplore.ieee.org/document/8102965/, https://doi.org/10.1109/IRI.2017.64
Sanders, W.H., Meyer, J.F.: Stochastic Activity Networks: Formal Definitions and Concepts. Lecture Notes in Computer Science, vol. 2090. Springer, Berlin (2001)
Sandmann, G., Thompson, R.: Development of AUTOSAR software components within model-based design. SAE Technical Paper 04 (2008). https://doi.org/10.4271/2008-01-0383
Santurkar, S., Arora, A., Chandrasekaran, K.: Stormgen—a domain specific language to create ad-hoc storm topologies. In: 2014 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 1621–1628 (Sept 2014). https://doi.org/10.15439/2014F278
Scheidgen, M., Zubow, A:. Map/reduce on emf models. In: MDHPCL@MoDELS. ACM (2012). http://dblp.uni-trier.de/db/conf/models/mdhpcl2012.html#ScheidgenZ12
Selic, B.: A systematic approach to domain-specific language design using UML. In: Tenth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC 2007), 7–9 May 2007, Santorini Island, Greece, pp. 2–9 Computer Society (2007)
Selic, B., Gerard, S. (eds.): Modeling and Analysis of Real-Time and Embedded Systems with UML and MARTE. Morgan Kaufmann, Boston (2014)
Smith, C.U., Williams, L.G.: Performance Solutions: A Practical Guide to Creating Responsive. Scalable Software. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA (2002)
The Apache Software Foundation. Apache Cassandra. http://cassandra.apache.org/. Accessed Dec 2018
The Apache Software Foundation. Apache Hadoop. http://hadoop.apache.org/. Accessed Dec 2018
The Apache Software Foundation. Apache Kafka. http://kafka.apache.org/. Accessed Dec 2018
The Apache Software Foundation. Apache Spark. http://spark.apache.org/. Accessed Dec 2018
The Apache Software Foundation. Apache Storm. http://storm.apache.org/. Accessed Dec 2018
The Apache Software Foundation. Apache Tez. http://tez.apache.org/. Accessed Dec 2018
The DICE Consortium. DICE Models Repository, Jan 2017. https://github.com/dice-project/DICE-Models
The DICE Consortium. DICE Profiles Repository, Sept 2017. https://github.com/dice-project/DICE-Profiles
The DICE Consortium. DICE Profiles, Sept 2017. https://github.com/dice-project/DICE-Profiles
The DICE Consortium. DICE Simulation tool, Oct 2017. https://github.com/dice-project/DICE-Simulation
The DICE Consortium. DICE-Rollout, Sept 2017. https://github.com/dice-project/DICER
The Object Management Group (OMG): Model-Driven Architecture Specification and Standardisation. Technical report (2018). http://www.omg.org/mda/
The DICE Consortium. DICE simulation tools. Technical report, European Union’s Horizon 2020 research and innovation programme (2017). http://wp.doc.ic.ac.uk/dice-h2020/wp-content/uploads/sites/75/2017/08/D3.4_DICE-simulation-tools-Final-version.pdf
The DICE Consortium. DICE transformations to Analysis Models. Technical report, European Union’s Horizon 2020 research and innovation programme (2016). http://wp.doc.ic.ac.uk/dice-h2020/wp-content/uploads/sites/75/2016/08/D3.1_Transformations-to-analysis-models.pdf
UML Profile for MARTE: Modeling and Analysis of Real-Time and Embedded Systems (June 2011). Version 1.1, OMG document: formal/2011-06-02
Unified Modeling Language: Infrastructure, 2017. Version 2.5.1, OMG document: formal/2017-12-05
Wang, K., Khan, M.M.H.: Performance prediction for Apache Apark platform. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), and 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), pp. 166–173 (2015)
Wettinger, J., Breitenbücher, U., Leymann, F.: Standards-based DevOps automation and integration using TOSCA. In: 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, pp. 59–68, Dec 2014. https://doi.org/10.1109/UCC.2014.14
WikiMedia project. Wikistats, Dec 2016. https://www.mediawiki.org/wiki/Analytics/Wikistats
Wille, R.: Formal concept analysis as mathematical theory of concepts and concept hierarchies. In: Formal Concept Analysis, pp. 1–33 (2005)
Woodside, C.M., Petriu, D.C., Merseguer, J., Petriu, D.B., Alhaj, M.: Transformation challenges: from software models to performance models. Softw. Syst. Model. 13(4), 1529–1552 (2014). https://doi.org/10.1007/s10270-013-0385-x
XLAB. XLAB, R&D (2018). https://www.xlab.si. Accessed Dec 2018
Acknowledgements
This work is supported by the European Commission Grant No. 644869 (H2020, Call 1), DICE. D. Perez-Palacin, J. Merseguer and J.I. Requeno have been supported by the project CyCriSec [TIN2014-58457-R] and Aragon Government Ref. T27-DISCO Research Group.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Prof. Dorina Petriu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: MARTE and DAM profiles
MARTE [64] is a standard profile that extends UML for the performance and schedulability analysis of a system. MARTE consists of three main parts: MARTE Foundations, MARTE Design Model and MARTE Analysis Model. The Analysis Model is of our interest since it enables the QoS assessment by allowing the definition of QoS metrics and properties. The Analysis Model consists of a Generic Quantitative Analysis and Modelling (GQAM) profile and its specialization, the Performance Analysis and Modelling (PAM) profile. In addition to this, two other features are also important for our DIA profile.
The first one is that MARTE enables the specification of quantitative non-functional properties (NFP) in UML models through its Value Specification Language (VSL). The VSL is useful for specifying the values of constraints, properties and stereotype attributes, particularly related to NFPs. Moreover, VSL allows to express basic types, data types, values (such as time and composite values), as well as variables, constants and expressions. This means that using VSL we can define complex metrics and requirements to express, for example, response times, utilizations or throughputs. MARTE also defines a library of primitive data types, a set of predefined NFP types and units of measures. Hence, our DIA profile inherits the VSL altogether.
For understanding the VSL expressions that appear in this paper, it is of interest to briefly recall its syntax. An example of VSL expression for a host demand tagged value of type NFP_Duration is:
This expression specifies that the Reducing activity in Fig. 13 demands 6 (1) milliseconds (2) of processing time, whose mean value (3) is obtained from an estimation in the real system (4). We could replace, for example, the value 6 for a variable $\(host\_dem\) to parameterize the analysis of the model with different values for this host demand.
The second feature is that the DAM [10] profile specializes MARTE-GQAM for dependability analysis (i.e., availability, reliability, safety and maintainability). Consequently, the DAM profile also inherits the VSL. As MARTE, DAM consists of a library and a set of extensions to be applied at model specification level. Our DIA profile inherits DAM with the purpose of addressing reliability analysis for DIA.
Appendix B: DIA Profile Library
In this Appendix, we present the DIA library. The library defines the data types, basic and complex, used in the attributes of the stereotypes proposed for the three abstraction levels, DPIM, DTSM and DDSM. Basic types appear in Fig. 22, while complex ones in Fig. 23. From DAM, we have imported the DAM Library [10], which also imports the MARTE Library [64].
Appendix C: TOSCA
TOSCA provides a flexible and highly extensible DSL for modeling resources and software components. TOSCA blueprints are executable IasC composed of node templates and relationships, defining the topology of a hardware/software systems. Node templates and relationships are instances of node types and relationship types, that are either normative (i.e., defined in the standard), provided by the specific engine that executes a blueprint (the orchestrator), or an extension of one of the above, such as in our case, with DIA-specific node and relationship types. Node types are essentially used to describe hardware or virtual resources (machines or VMs) and software components. Relationship types predicate on the association between node types. For instance, a TOSCA node type representing Wordpress CMS must be associated with a node type presenting VMs through the relationship hosted_On. Each node type and relationship type also enables specifying interfaces, which are composed of operations that have to be carried out at specific stages of the deployment orchestration. Typical examples of interface operations include installing, configuring or starting of components, and may take form of Python/bash scripts, or pointers to Chef recipes. Node and relationship templates are free to provide their own interface operations, extending or overriding behavior defined in the corresponding types. TOSCA is being supported by a number of orchestrators that, given a TOSCA blueprint and all node and relationship types used there, are able to execute it deploying the corresponding system and managing its lifecycle. Examples of such orchestrators are Cloudify,Footnote 18 ARIA TOSCA,Footnote 19 Indigo,Footnote 20 Apache BrooklynFootnote 21 or ECoWare [6].
Appendix D: Transformation of a DTSM design to a performance model
Stochastic Well-formed Nets (SWN) [16] are a modeling formalism suitable for performance analysis purposes. A SWN model is a bipartite graph formed by places and transitions. Places are graphically depicted as circles and may contain tokens. A token distribution in the places of a SWN, namely a marking, represents a state of the modeled system. The dynamic of the system is governed by the transition enabling and firing rules, where places represent pre- and post-conditions for transitions. In particular, the firing of a transition removes (adds) as many tokens from its input (output) places as the weights of the corresponding input (output) arcs. Transitions can be immediate, those that fire in zero time; or timed, those that fire after a delay which is sampled from a random variable with a given probability distribution function. Immediate transitions are graphically depicted as black thin bars, while timed ones are depicted as white thick bars. Tokens may also have an associated color, i.e., a data type, which enriches the expressiveness of the net and restricts the movement of tokens to compatible places and transitions.
Figure 24 depicts a schema of how Apache Hadoop stereotypes in UML-profiled models (left) are transformed into an analyzable model such as a SWN (right). Each stereotype is transformed into a sub-net by taking into account the information contained in the tags. For each transformation pattern in the Figure, the part of the Petri net inside the blue box corresponds to the part that the transformation creates. The part of the Petri net outside the blue box corresponds to referenced parts, which are in turn created by other stereotypes. Figure 24 depicts only the specific non-functional annotations for Apache Hadoop; the functional part of the UML diagram is transformed according to the works in [33, 70]. Eventually, all the sub-nets are composed into a single closed Petri net such as in Fig. 16.
A Hadoop cluster accepts several categories of users, whose jobs are probably subdivided into a different number of map–reduce tasks or have assigned a different number of hardware resources. Every user \(<i>\) has \(\$nC_i\) jobs waiting in the scheduler queue. Hadoop scheduler launches periodically a new job at a given \(\$rate\) following a scheduling policy defined by the scenario (e.g., a shared common FIFO queue for all users). By default, our transformation assumes an independent FIFO queue for each user and always guarantees to take a job of each user.
Jobs are labeled with the user they belong to (loop \(<i>\)-\(<i+1>\) in the net, where \(<i>\) represents each user). The scheduler waits for the assignment of resources to all tasks in the reduce phase of the precedent job \(<i>\) before launching the next job \(<i+1>\) (inhibitor arc section). This scheduling allows both concurrency among jobs and giving priority over resources to precedent jobs. Job \(<i>\) is divided in \(\$m_i\) map tasks and \(\$r_i\) reduce tasks, that run simultaneously in up to \(\$p_i\) cores (\(\sum _{i=1}^{n} \$p_i \ge \$host\), being \(\$host\) the total number of cores in the cluster). We use the notation \(<i>\) for expressing the color of a token. For instance, each user is represented by a different color in the SWN. Notation \(\$m_i\) is used for expressing numerical values; for instance, the number of map tasks in which a job of type \(<i>\) is divided.
Appendix E: Usability of the profile
The validation of the DIA profile has been carried out so far from the point of view of its adequacy to solve the QoS assessment and the deployment. However, we consider also important to learn about the usability of the profile, in terms of easiness of use for engineers. It uses to happen that tools, although offering the required functionalities, do not reach their expectations until a degree of maturity is accepted at this regard.
The DIA profile has been used by engineers in four organizations: Prodevelop [38], ATC [5], BluAge [12] and XLAB R&D [71]. We have prepared eight questions, see Table 4, for a total of eight engineers, who have extensively used the DIA profile in the context of the DICE project to carry out industrial applications. From the answers, we see that the profile has been useful for the engineers, specially for the automatic deployment. However, the main lack refers to the Papyrus implementation (see question #6) that also constraints the profile implementation. In fact, the advice of the engineers (see question #8) referred to improve the Papyrus implementation of the profile.
Rights and permissions
About this article
Cite this article
Perez-Palacin, D., Merseguer, J., Requeno, J.I. et al. A UML Profile for the Design, Quality Assessment and Deployment of Data-intensive Applications. Softw Syst Model 18, 3577–3614 (2019). https://doi.org/10.1007/s10270-019-00730-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10270-019-00730-3