Data-intensive architecture for scientific knowledge discovery | Distributed and Parallel Databases Skip to main content
Log in

Data-intensive architecture for scientific knowledge discovery

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

This paper presents a data-intensive architecture that demonstrates the ability to support applications from a wide range of application domains, and support the different types of users involved in defining, designing and executing data-intensive processing tasks. The prototype architecture is introduced, and the pivotal role of DISPEL as a canonical language is explained. The architecture promotes the exploration and exploitation of distributed and heterogeneous data and spans the complete knowledge discovery process, from data preparation, to analysis, to evaluation and reiteration. The architecture evaluation included large-scale applications from astronomy, cosmology, hydrology, functional genetics, imaging processing and seismology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The Square Kilometer Array (http://www.skatelescope.org) will generate about 200 GB of raw data per second and the LOFAR (http://www.lofar.org/) low band antennas generate 1.6 TB raw data per second.

  2. The Euclid Imaging Consortium (http://www.ias.u-psud.fr/imEuclid) will generate 1 PB data per year and the Large Synoptic Survey Telescope (http://www.lsst.org) will generate several petabytes of new image and catalogue data every year.

  3. Sloan Digital Sky Survey: http://www.sdss.org/.

  4. ADMIRE project: http://www.admire-project.eu/.

  5. ADMIRE prototype: http://sourceforge.net/projects/admire/.

  6. ADMIRE publications: http://www.admire-project.eu/admire-library/index.html.

  7. Eclipse: http://www.eclipse.org/.

  8. These are functions that when supplied with parameters such as PEs, generate graphs with those PEs in them. The graph is then treated just like any other.

  9. Virtual Earthquake and seismology Research Community e-science environment in Europe: http://www.verce.eu/.

References

  1. Atkinson, M.P., Baxter, R., Besana, P., Galea, M., Parsons, M., Brezany, P., Corcho, O., van Hemert, J., Snelling, D.: The DATA Bonanza—Improving Knowledge Discovery for Science, Engineering and Business. Wiley, New York (2012). To be published

    Google Scholar 

  2. Atkinson, M.P., Galea, M., Liew, C.S., Martin, P.: Final report on the ADMIRE architecture, with an assessment and proposals for its development. Tech. rep., The ADMIRE Project (2011)

  3. Barga, R., Jackson, J., Araujo, N., Guo, D., Gautam, N., Simmhan, Y.: The Trident scientific workflow workbench. In: Proceedings of the 2008 Fourth IEEE International Conference on eScience, e-Science ’08, pp. 317–318. IEEE Comput. Soc., Los Alamitos (2008)

    Chapter  Google Scholar 

  4. Barseghian, D., Altintas, I., Jones, M.B., Crawl, D., Potter, N., Gallagher, J., Cornillon, P., Schildhauer, M., Borer, E.T., Seabloom, E.W., Hosseini, P.R.: Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis. Ecol. Inform. 5(1), 42–50 (2010)

    Article  Google Scholar 

  5. Bell, G., Hey, T., Szalay, A.S.: Beyond the data deluge. Science 323(5919), 1297–1298 (2009)

    Article  Google Scholar 

  6. Berriman, G.B., Groom, S.L.: How will astronomy archives survive the data tsunami? Commun. ACM 54(12), 52–56 (2011)

    Article  Google Scholar 

  7. Buil-Aranda, C., Arenas, M., Corcho, O.: Semantics and optimization of the SPARQL 1.1 federation extension. In: Proceedings of the 8th Extended Semantic Web Conference on the Semanic Web: Research and Applications—Volume Part II, ESWC’11, pp. 1–15. Springer, Berlin (2011)

    Google Scholar 

  8. Curcin, V., Ghanem, M.: Scientific workflow systems—can one size fit all? In: Cairo International Biomedical Engineering Conference, CIBEC ’08, pp. 1–9 (2008)

    Chapter  Google Scholar 

  9. De Roure, D., Goble, C., Stevens, R.: The design and realisation of the myExperiment virtual research environment for social sharing of workflows. Future Gener. Comput. Syst. 25, 561–567 (2009)

    Article  Google Scholar 

  10. Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener. Comput. Syst. 25(5), 528–540 (2009)

    Article  Google Scholar 

  11. Deelman, E., Singh, G., Su, M.H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13(3), 219–237 (2005)

    Google Scholar 

  12. Dobrzelecki, B., Krause, A., Hume, A., Grant, A., Antonioletti, M., Alemu, T., Atkinson, M.P., Jackson, M., Theocharopoulos, E.: Integrating distributed data sources with OGSA-DAI DQP and views. Philos. Trans. R. Soc. Lond. A 368(1926), 4133–4145 (2010)

    Article  Google Scholar 

  13. ebXML Business Process Technical Committee: ebXML business process specification schema technical specification (version 2.0.4). Tech. rep., OASIS (2006)

  14. Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: the system S declarative stream processing engine. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 1123–1134. ACM, New York (2008)

    Chapter  Google Scholar 

  15. Gorton, I., Greenfield, P., Szalay, A., Williams, R.: Data-intensive computing in the 21st century. Computer 41(4), 30–32 (2008)

    Article  Google Scholar 

  16. Gray, J.: Jim Gray on eScience: a transformed scientific method. In: Hey, T., Tansley, S., Tolle, K. (eds.) The Fourth Paradigm: Data-Intensive Scientific Discovery, pp. xix–xxxiii. Microsoft Research, Washington (2009)

    Google Scholar 

  17. Gray, J., Liu, D.T., Nieto-Santisteban, M., Szalay, A., DeWitt, D.J., Heber, G.: Scientific data management in the coming decade. SIGMOD Rec. 34, 34–41 (2005)

    Article  Google Scholar 

  18. Habala, O., Jarka, M., Laclavik, M., Simo, B., Tran, V.: Report on pilot applications deployment and platform evaluation. Tech. rep., The ADMIRE Project (2011)

  19. Han, L., Liew, C.S., van Hemert, J.I., Atkinson, M.P.: A generic parallel processing model for facilitating data mining and integration. Parallel Comput. 37(3), 157–171 (2011)

    Article  Google Scholar 

  20. Hey, T., Tansley, S., Tolle, K. (eds.): The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Washington (2009)

    Google Scholar 

  21. Hull, D., Wolstencroft, K., Stevens, R., Goble, C.A., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34(web–server-issue), 729–732 (2006)

    Article  Google Scholar 

  22. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, EuroSys ’07, pp. 59–72. ACM, New York (2007)

    Google Scholar 

  23. Jordon, D., Evdemon, J.: Web services business process execution language, version 2.0, OASIS standard. Tech. rep., OASIS (2007)

  24. Language and Architecture Team, ADMIRE project: DISPEL: data-intensive systems process engineering language users’ manual (version 1.0). Tech. rep., School of Informatics, University of Edinburgh (2011)

  25. Lee, E.A., Neuendorffer, S.: MoML—a Modeling Markup Language in XML–version 0.4. Tech. rep., University of California at Berkeley (2000)

  26. Liew, C.S., Atkinson, M.P., Ostrowski, R., Cole, M., van Hemert, J.I., Han, L.: Performance database: capturing data for optimizing distributed streaming workflows. Philos. Trans. R. Soc. Lond. A 369(1949), 3268–3284 (2011)

    Article  Google Scholar 

  27. Llorá, X., Ács, B., Auvil, L.S., Capitanu, B., Welge, M.E., Goldberg, D.E.: Meandre: semantic-driven data-intensive flows in the clouds. In: IEEE Fourth International Conference on eScience, pp. 238–245. IEEE Press, New York (2008)

    Chapter  Google Scholar 

  28. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurr. Comput. 18(10), 1039–1065 (2006)

    Article  Google Scholar 

  29. Muthukrishnan, S.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)

    Article  MathSciNet  Google Scholar 

  30. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004)

    Article  Google Scholar 

  31. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 1099–1110. ACM, New York (2008)

    Chapter  Google Scholar 

  32. Pallickara, S., Ekanayake, J., Fox, G.: Granules: a lightweight, streaming runtime for cloud computing with support for map-reduce. In: Proceedings of the IEEE International Conference on Cluster Computing and Workshops, CLUSTER ’09, pp. 1–10 (2009)

    Chapter  Google Scholar 

  33. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with Sawzall. Sci. Program. 13(4), 227–298 (2005)

    Google Scholar 

  34. Stonebraker, M., Becla, J., Dewitt, D., Lim, K.T., Maier, D., Ratzesberger, O., Zdonik, S.: Requirements for science data bases and SciDB. In: Conference on Innovative Data Systems Research (CIDR) (2009)

    Google Scholar 

  35. Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana workflow environment: architecture and applications. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workflows for e-Science, pp. 320–339. Springer, London (2007)

    Chapter  Google Scholar 

  36. Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: a language for streaming applications. In: Horspool, R. (ed.) Compiler Construction. Lecture Notes in Computer Science, vol. 2304, pp. 49–84. Springer, Berlin (2002)

    Chapter  Google Scholar 

  37. Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S., Foster, I.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011)

    Article  Google Scholar 

  38. Yaikhom, G., Atkinson, M.P., van Hemert, J.I., Corcho, O., Krause, A.: Validation and mismatch repair of workflows through typed data streams. Philos. Trans. R. Soc. Lond. A 369(1949), 3285–3299 (2011)

    Article  Google Scholar 

  39. Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. J. Grid Comput. 3, 171–200 (2005)

    Article  Google Scholar 

  40. Zhao, Y., Hategan, M., Clifford, B., Foster, I., von Laszewski, G., Nefedova, V., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: fast, reliable, loosely coupled parallel computation. In: Proceedings of the 2007 IEEE Congress on Services, SERVICES ’07, pp. 199–206. IEEE Comput. Soc., Los Alamitos (2007)

    Chapter  Google Scholar 

Download references

Acknowledgements

The work presented in this paper is supported by the ADMIRE project (funded by EU FP7-ICT- 215024) and the e-Science Core Programme Senior Research Fellow programme (funded by the UK EPSRC EP/D079829/1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chee Sun Liew.

Additional information

Communicated by Judy Qiu and Dennis Gannon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Atkinson, M., Liew, C.S., Galea, M. et al. Data-intensive architecture for scientific knowledge discovery. Distrib Parallel Databases 30, 307–324 (2012). https://doi.org/10.1007/s10619-012-7105-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-012-7105-3

Keywords

Navigation