Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization | Journal of Computer Science and Technology Skip to main content
Log in

Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

We describe a data deduplication system for backup storage of PC disk images, named in-RAM metadata utilizing deduplication (IR-MUD). In-RAM hash granularity adaptation and miniLZO based data compression are firstly proposed to reduce the in-RAM metadata size and thereby reduce the space overheads required by the in-RAM metadata caches. Secondly, an in-RAM metadata write cache, as opposed to the traditional metadata read cache, is proposed for further reducing metadata-related disk I/O operations and improving deduplication throughput. During deduplication, the metadata write cache is managed following the LRU caching policy. For each manifest that is hit in the metadata write cache, an expensive manifest reloading operation from the disk is avoided. After deduplication, all the manifests in the metadata write cache are cleared and stored on the disk. Our experimental results using 1.5 TB real-world disk image dataset show that 1) IR-MUD achieved about 95% size reduction for the deduplication metadata, with a small time overhead introduced, 2) when the metadata write cache was not utilized, with the same RAM space size for the metadata read cache, IR-MUD achieved a 400% higher RAM hit ratio and a 50% higher deduplication throughput, as compared with the classic Sparse Indexing deduplication system where no metadata utilization approaches are utilized, and 3) when the metadata write cache was utilized and enough RAM space was available, IR-MUD achieved a 500% higher RAM hit ratio compared with Sparse Indexing and a 70% higher deduplication throughput compared with IR-MUD with only a single metadata read cache. The in-RAM metadata harnessing and metadata write caching approaches of IR-MUD can be applied in most parallel deduplication systems for improving metadata caching efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Black J. Compare-by-hash: A reasoned analysis. In Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp.85-90.

  2. Meister D, Kaiser J, Brinkmann A, Cortes T, Kuhn M, Kunkel J. A study on data deduplication in HPC storage systems. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2012, Article No. 7.

  3. Bloom B H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, July 1970, 13(7): 422-426.

  4. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse Indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th USENIX Conference on File and Storage Technologies (FAST), February 2009, pp.111-123.

  5. Tanenbaum A S. Modern Operating Systems (2nd edition). Prentice Hall PTR, 2001.

  6. Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Processing (ICPP), October 2013, pp.389-398.

  7. Rabin M O. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.

  8. Muthitacharoen A, Chen B, Mazi`eres D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles, October 2001, pp.174-187.

  9. Roma´nski B, Heldt T, Kilian W et al. Anchor-driven subchunk deduplication. In Proc. the 4th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1-6:13.

  10. Tolia N, Kozuch M, Satyanarayanan M, Karp B, Bressoud T, Perrig A. Opportunistic use of content addressable storage for distributed file systems. In Proc. the USENIX Annual Technical Conference (ATC), June 2003, pp.127-140.

  11. Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme Binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.

  12. Shilane P, Huang M, Wallace G, Hsu W. WAN-optimized replication of backup datasets using stream-informed delta compression. Transactions on Storage, 2012, 8(4): 13:1-13:26.

  13. Tuduce I C, Gross T. Adaptive main memory compression. In Proc. the USENIX Annual Technical Conference (ATEC), April 2005, pp.237-250.

  14. Wilson P R, Kaplan S F, Smaragdakis Y. The case for compressed caching in virtual memory systems. In Proc. the USENIX Annual Technical Conference (ATEC), June 1999, pp.101-116.

  15. Gupta D, Lee S, Vrable M, Savage S, Snoeren A C, Varghese G, Voelker G M, Vahdat A. Difference engine: Harnessing memory redundancy in virtual machines. Commun. ACM, 2010, 53(10): 85-93.

    Article  Google Scholar 

  16. Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 1977, 23(3): 337-343.

    Article  MathSciNet  MATH  Google Scholar 

  17. Jin K, Miller E L. The effectiveness of deduplication on virtual machine disk images. In Proc. the 2nd Annual International Systems and Storage Conference: The Israeli Experimental Systems Conference (SYSTOR), May 2009, pp.7:1-7:12.

  18. Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST), February 2008, Article No. 18.

  19. Wildani A, Miller E, Rodeh O. HANDS: A heuristically arranged non-backup inline deduplication system. In Proc. the 29th IEEE International Conference on Data Engineering, April 2013, pp.446-457.

  20. Botelho F C, Shilane P, Gang N, Hsu W. Memory efficient sanitization of a deduplicated storage system. In Proc. the 11th USENIX Conference on File and Storage Technologies (FAST), February 2013, pp.81-84.

  21. Alameldeen A R, Wood D A. Adaptive cache compression for high-performance processors. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA), June 2004, pp.212-223.

  22. Debnath B, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the USENIX Annual Technical Conference (ATC), June 2010, Article No. 16.

  23. Meister D, Brinkmann A. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), May 2010.

  24. Chen F, Luo T, Zhang X. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. the 9th USENIX Conference on File and Stroage Technologies (FAST), February 2011, Article No. 6.

  25. Agrawal N, Prabhakaran V, Wobber T, Davis J D, Manasse M, Panigrahy R. Design tradeoffs for SSD performance. In Proc. the USENIX Annual Technical Conference (ATC), June 2008, pp.57-70.

  26. Kaiser J, Brinkmann A, Süβ T, Meister D. Deriving and comparing deduplication techniques using a model-based classification. In Proc. the 10th European Conference on Computer Systems (EuroSys), April 2015, pp.11:1-11:13.

  27. Srinivasan K, Bisson T, Goodson G, Voruganti K. iDedup: Latency-aware, inline data deduplication for primary storage. In Proc. the 10th USENIX Conference on File and Storage Technologies (FAST), February 2012, pp.299-312.

  28. Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4): 424-448.

    Article  Google Scholar 

  29. Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams. In Proc. the 8th USENIX Conference on File and Storage Technologies (FAST), February 2010, Article No. 18.

  30. Lu G, Jin Y, Du D. Frequency based chunking for data deduplication. In Proc. IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), August 2010, pp.287-296.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Zhou.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, B., Wen, JT. Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization. J. Comput. Sci. Technol. 31, 805–819 (2016). https://doi.org/10.1007/s11390-016-1664-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-016-1664-0

Keywords

Navigation