As a main subfield of cloud computing applications, internet services require large-scale data computing. Their workloads can be divided into two classes: customer-facing query-processing interactive tasks that serve hundreds of millions of users within a short response time and backend data analysis batch tasks that involve petabytes of data. Hadoop, an open source software suite, is used by many Internet services as the main data computing platform. Hadoop is also used by academia as a research platform and an optimization target. This paper presents five research directions for optimizing Hadoop; improving performance, utilization, power efficiency, availability, and different consistency constraints. The survey covers both backend analysis and customer-facing workloads. A total of 15 innovative techniques and systems are analyzed and compared, focusing on main research issues, innovative techniques, and optimized results.<\/p>","DOI":"10.4018\/ijcac.2011010104","type":"journal-article","created":{"date-parts":[[2011,10,19]],"date-time":"2011-10-19T16:07:05Z","timestamp":1319040425000},"page":"45-61","source":"Crossref","is-referenced-by-count":6,"title":["Beyond Hadoop"],"prefix":"10.4018","volume":"1","author":[{"given":"Zhiwei","family":"Xu","sequence":"first","affiliation":[{"name":"Chinese Academy of Sciences, China"}]},{"given":"Bo","family":"Yan","sequence":"additional","affiliation":[{"name":"Chinese Academy of Sciences, China"}]},{"given":"Yongqiang","family":"Zou","sequence":"additional","affiliation":[{"name":"Tencent Research, China"}]}],"member":"2432","reference":[{"key":"ijcac.2011010104-0","doi-asserted-by":"crossref","unstructured":"Brewer, E. A. (2000). Towards robust distributed systems. In Proceedings of the 19th ACM Symposium on Principles of Distributed Computing, Portland, OR (p. 7).","DOI":"10.1145\/343477.343502"},{"key":"ijcac.2011010104-1","unstructured":"Cassandra (n. d.). The apache cassandra project. Retrieved from http:\/\/cassandra.apache.org\/"},{"key":"ijcac.2011010104-2","unstructured":"Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., et al. (2006). Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. (Vol. 7, pp. 205-218)."},{"issue":"2","key":"ijcac.2011010104-3","first-page":"1277","article-title":"PNUTS: Yahoo!'s hosted data serving platform.","volume":"1","author":"B. F.Cooper","year":"2006","journal-title":"Very Large Data Base Endowment"},{"key":"ijcac.2011010104-4","doi-asserted-by":"crossref","unstructured":"DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., & Pilchin, A. (2007). Dynamo: Amazon's highly available key-value store. In Proceedings of the 21st ACM SIGOPS Symposium on Operating Systems Principles (Vol. 21, pp. 205-220).","DOI":"10.1145\/1323293.1294281"},{"key":"ijcac.2011010104-5","doi-asserted-by":"crossref","unstructured":"Fan, B., Tantisiriroj, W., Xiao, L., & Gibson, G. (2009). DiskReduce: RAID for data-intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage, Portland, OR (pp. 6-10).","DOI":"10.1145\/1713072.1713075"},{"key":"ijcac.2011010104-6","unstructured":"Hadoop (n. d.). The apache hadoop project. Retrieved from http:\/\/hadoop.apache.org\/"},{"key":"ijcac.2011010104-7","unstructured":"Hadoop ZooKeeper. (n. d.). The apache hadoop zookeeper project. Retrieved from http:\/\/hadoop.apache.org\/zookeeper\/"},{"key":"ijcac.2011010104-8","unstructured":"HBase. (n. d.). The apache hbase project. Retrieved from http:\/\/hbase.apache.org\/"},{"key":"ijcac.2011010104-9","doi-asserted-by":"crossref","unstructured":"Kim, S., Han, H., Jung, H., Eom, H., & Yeom, H. Y. (2010). Harnessing input redundancy in a MapReduce framework. In Proceedings of the ACM Symposium on Applied Computing, Sierre, Switzerland (pp. 362-366).","DOI":"10.1145\/1774088.1774167"},{"key":"ijcac.2011010104-10","doi-asserted-by":"crossref","unstructured":"Lakshman, A., & Malik, P. (2009). Cassandra: Structured storage system over a P2P network. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (p. 5).","DOI":"10.1145\/1582716.1582722"},{"key":"ijcac.2011010104-11","doi-asserted-by":"publisher","DOI":"10.1145\/1773912.1773922"},{"key":"ijcac.2011010104-12","doi-asserted-by":"crossref","unstructured":"Leverich, J., & Kozyrakis, C. (2009). On the energy (in)efficiency of Hadoop clusters. In Proceedings of the Workshop on Power Aware Computing and Systems, Big Sky, MT (pp.61-65).","DOI":"10.1145\/1740390.1740405"},{"key":"ijcac.2011010104-13","doi-asserted-by":"publisher","DOI":"10.1145\/1773912.1773923"},{"key":"ijcac.2011010104-14","doi-asserted-by":"crossref","unstructured":"Silberstein, A., Cooper, B. F., Srivastava, U., Vee, E., Yerneni, R., & Ramakrishnan, R. (2008). Efficient bulk insertion into a distributed ordered table. In Proceedings of the International Conference on Management of Data (pp. 765-778).","DOI":"10.1145\/1376616.1376693"},{"key":"ijcac.2011010104-15","first-page":"149","article-title":"Chord: A scalable peer-to-peer lookup service for internet applications. In","volume":"29","author":"I.Stoica","year":"2001","journal-title":"Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications"},{"key":"ijcac.2011010104-16","doi-asserted-by":"crossref","unstructured":"Tian, C., Zhou, H., He, Y., & Zha, L. (2009). A dynamic MapReduce scheduler for heterogeneous workloads. In Proceedings of the 8th International Conference on Grid and Cooperative Computing, Lanzhou, China (pp. 218-224).","DOI":"10.1109\/GCC.2009.19"},{"issue":"1","key":"ijcac.2011010104-17","first-page":"682","article-title":"Adaptively parallelizing distributed range queries.","volume":"2","author":"Y.Vigfusson","year":"2009","journal-title":"Very Large Data Base Endowment"},{"key":"ijcac.2011010104-18","doi-asserted-by":"crossref","unstructured":"Zaharia, M., Borthakur, D., Sarma, J. S., Elmeleegy, K., Shenker, S., & Stoica, I. (2010). Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems, Paris, France (pp. 265-278).","DOI":"10.1145\/1755913.1755940"},{"key":"ijcac.2011010104-19","unstructured":"Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008). Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (pp. 29-42)."},{"key":"ijcac.2011010104-20","doi-asserted-by":"crossref","unstructured":"Zou, Y. Q., Liu, J., Wang, S. C., Zha, L., & Xu, Z. W. (2010). CCIndex: A complemental clustering index on distributed ordered tables for multi-dimensional range queries. In Proceedings of the 7th IFIP International Conference on Network and Parallel Computing, Zhengzhou, China (pp. 247-261).","DOI":"10.1007\/978-3-642-15672-4_22"}],"container-title":["International Journal of Cloud Applications and Computing"],"original-title":[],"language":"ng","link":[{"URL":"https:\/\/www.igi-global.com\/viewtitle.aspx?TitleId=53142","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,6,2]],"date-time":"2022-06-02T00:55:59Z","timestamp":1654131359000},"score":1,"resource":{"primary":{"URL":"https:\/\/services.igi-global.com\/resolvedoi\/resolve.aspx?doi=10.4018\/ijcac.2011010104"}},"subtitle":["Recent Directions in Data Computing for Internet Services"],"short-title":[],"issued":{"date-parts":[[2011,1,1]]},"references-count":21,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2011,1]]}},"URL":"https:\/\/doi.org\/10.4018\/ijcac.2011010104","relation":{},"ISSN":["2156-1834","2156-1826"],"issn-type":[{"value":"2156-1834","type":"print"},{"value":"2156-1826","type":"electronic"}],"subject":[],"published":{"date-parts":[[2011,1,1]]}}}