In today’s enterprise world, information about business entities such as a customer’s or patient’s name, address, and social security number is often present in both relational databases as well as content repositories. Information about such business entities is generally well protected in databases by well-defined and fine-grained access control. However, current document retrieval systems do not provide user-specific, fine-grained redaction of documents to prevent leakage of information about business entities from documents. Leaving companies with only two choices: either providing complete access to a document, risking potential information leakage, or prohibiting access to the document altogether, accepting potentially negative impact on business processes. In this paper, we present ZoRRo, an add-on for document retrieval systems to dynamically redact sensitive information of business entities referenced in a document based on access control defined for the entities. ZoRRo exploits database systems’ fine-grained, label-based access-control mechanism to identify and redact sensitive information from unstructured text, based on the access privileges of the user viewing it. To make on-the-fly redaction feasible, ZoRRo exploits the concept of \(k\)-safety in combination with Lucene-based indexing and scoring. We demonstrate the efficiency and effectiveness of ZoRRo through a detailed experimental study.

Similar content being viewed by others
Name withheld due to legal and privacy reasons.
The original as well as the redacted documents can be downloaded from https://docs.google.com/file/d/0B_Kz16TTj09IMzVLRGdFMkJwZFU/edit?usp=sharing.
Scale out storage in the content driven enterprise: unleashing the value of information assets. http://h17007.www1.hp.com/docs/whatsnew/4AA3-4945ENW.pdf, iDC White Paper, 2010
Code of practice for information security management. iSO/IEC 27002:2005 Information technology—security techniques
Douglass M, Cliffford G, Reisner A, Long W, Moody G, Mark R (2005) Deidentification algorithm for free-text nursing notes. In: Computers in cardiology, S6.2
Jiang W, Murugesan M, Clifton C, Si L (2009) t-Plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 international conference on computational science and engineering, vol 03, pp 68–75
Saygin Y, Hakkani-Tür D, Tür G (2009) Sanitization and anonymization of document repositories. In: Database technologies: concepts, methodologies, tools, and applications, pp 2129–2139
Sweeney L (1996) Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, pp 333–337
Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2008) Efficient techniques for document sanitization. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 843–852
Bettini C, Wang XS, Jajodia S (2005) Information release control: a learning-based architecture. In: Journal on data semantics II. Springer, pp 176–198
Monteith E (2001) Genoa TIE, advanced boundary controller experiment. In: Proceedings of the 17th annual computer security applications, pp 74–82
Wiederhold G (2002) Protecting information when access is granted for collaboration. In: Data and application security, pp 1–14
Balinsky HY, Simske SJ (2010) Differential access for publicly-posted composite documents with multiple workflow participants. In: Proceedings of the 10th ACM symposium on document engineering, pp 115–124
Balinsky H, Chen L, Simske SJ (2011) Publicly posted composite documents with identity based encryption. In: Proceedings of the 11th ACM symposium on document engineering, pp 239–248
Sahai A, Waters B (2005) Fuzzy identity-based encryption. In: Advances in cryptology-EUROCRYPT. Springer, pp 457–473
Zheng Y (2011) Privacy-preserving personal health record system using attribute-based encryption. Ph.D. Dissertation, Worcester Polytechnic Institute
Wang G, Liu Q, Wu J (2010) Hierarchical attribute-based encryption for fine-grained access control in cloud storage services. In: Proceedings of the 17th ACM conference on computer and communications security. ACM, pp 735–737
Cumby CM, Ghani R (2011) A machine learning based system for semi-automatically redacting documents. In: Proceedings of the 23rd conference on innovative applications of artificial intelligence
Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertain Fuzziness Knowl Based Syst 10(5):571–588 [Online]. http://dx.doi.org/10.1142/S021848850200165X
Cumby C, Ghani R (2010) Inference control to protect sensitive information in text documents. In: ACM SIGKDD workshop on intelligence and security informatics, pp 5:1–5:7
Dwork C (2006) Differential privacy. In: Automata, languages and programming. Springer, pp 1–12
McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of Data. ACM, pp 19–30
Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 493–502
IBM InfoSphere Guardium Data Redaction System. Solution Brief. http://public.dhe.ibm.com/common/ssi/ecm/en/ims14137usen/IMS14137USEN.PDF, iBM, 2012
Mansuri IR, Sarawagi S (2006) Integrating unstructured data into relational databases. In: Proceedings of the 22nd international conference on data engineering, p 29
Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2006) Efficiently linking text documents with relevant structured information. In: Proceedings of the 32nd international conference on very large data bases, pp 667–678
Murthy K, Deshpande PM, Dey A, Halasipuram R, Mohania MK, Deepak P, Reed J, Schumacher S (2012) Exploiting evidence from unstructured data to enhance master data management. PVLDB 5(12):1862–1873
Rjaibi W (2006) Label-based access control (LBAC) in DB2 LUW. In: Proceedings of the 2006 international conference on privacy, security and trust, pp 7:1–7:1
Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. Proc VLDB Endowm 1(1):115–125
Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd international conference on very large data bases, pp 139–150
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (March 2007) l-diversity: Privacy beyond k-anonymity. ACM Trans Knowl Discov Data. doi:10.1145/1217299.1217302
Li N, Li T (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd international conference on data engineering
FileNet Content Manager. http://www-01.ibm.com/software/ecm/filenet/, iBM Corporation
Apache Lucene. http://lucene.apache.org/
Apache Tika. http://tika.apache.org/
Apache PDFBox. http://pdfbox.apache.org/
Apache POI. http://poi.apache.org/
IBM DB2 10. http://www-01.ibm.com/software/data/db2-warehouse-10/
Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisation. In: IJCAI
Denis F (2001) Learning regular languages from simple positive examples. Mach Learn 44(1/2):37–66
Fernau H (2009) Algorithms for learning regular expressions from positive data. Inf Comput 207(4):521–541
Pöss M, Smith B, Kollár L, Larson P-Å (2005) Tpc-ds, taking decision support benchmarking to the next level. In: Proceedings of the (2002) ACM SIGMOD international conference on management of data, pp 582–587
US Department of Treasury SDN Data. http://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/default.aspx
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Deshpande, P.M., Joshi, S., Dewan, P. et al. The Mask of ZoRRo: preventing information leakage from documents. Knowl Inf Syst 45, 705–730 (2015). https://doi.org/10.1007/s10115-014-0811-6
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0811-6