Abstract
In today’s enterprise world, information about business entities such as a customer’s or patient’s name, address, and social security number is often present in both relational databases as well as content repositories. Information about such business entities is generally well protected in databases by well-defined and fine-grained access control. However, current document retrieval systems do not provide user-specific, fine-grained redaction of documents to prevent leakage of information about business entities from documents. Leaving companies with only two choices: either providing complete access to a document, risking potential information leakage, or prohibiting access to the document altogether, accepting potentially negative impact on business processes. In this paper, we present ZoRRo, an add-on for document retrieval systems to dynamically redact sensitive information of business entities referenced in a document based on access control defined for the entities. ZoRRo exploits database systems’ fine-grained, label-based access-control mechanism to identify and redact sensitive information from unstructured text, based on the access privileges of the user viewing it. To make on-the-fly redaction feasible, ZoRRo exploits the concept of \(k\)-safety in combination with Lucene-based indexing and scoring. We demonstrate the efficiency and effectiveness of ZoRRo through a detailed experimental study.







Similar content being viewed by others
Notes
Name withheld due to legal and privacy reasons.
The original as well as the redacted documents can be downloaded from https://docs.google.com/file/d/0B_Kz16TTj09IMzVLRGdFMkJwZFU/edit?usp=sharing.
References
Scale out storage in the content driven enterprise: unleashing the value of information assets. http://h17007.www1.hp.com/docs/whatsnew/4AA3-4945ENW.pdf, iDC White Paper, 2010
Code of practice for information security management. iSO/IEC 27002:2005 Information technology—security techniques
Douglass M, Cliffford G, Reisner A, Long W, Moody G, Mark R (2005) Deidentification algorithm for free-text nursing notes. In: Computers in cardiology, S6.2
Jiang W, Murugesan M, Clifton C, Si L (2009) t-Plausibility: semantic preserving text sanitization. In: Proceedings of the 2009 international conference on computational science and engineering, vol 03, pp 68–75
Saygin Y, Hakkani-Tür D, Tür G (2009) Sanitization and anonymization of document repositories. In: Database technologies: concepts, methodologies, tools, and applications, pp 2129–2139
Sweeney L (1996) Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings of the AMIA Annual Fall Symposium, American Medical Informatics Association, pp 333–337
Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2008) Efficient techniques for document sanitization. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 843–852
Bettini C, Wang XS, Jajodia S (2005) Information release control: a learning-based architecture. In: Journal on data semantics II. Springer, pp 176–198
Monteith E (2001) Genoa TIE, advanced boundary controller experiment. In: Proceedings of the 17th annual computer security applications, pp 74–82
Wiederhold G (2002) Protecting information when access is granted for collaboration. In: Data and application security, pp 1–14
Balinsky HY, Simske SJ (2010) Differential access for publicly-posted composite documents with multiple workflow participants. In: Proceedings of the 10th ACM symposium on document engineering, pp 115–124
Balinsky H, Chen L, Simske SJ (2011) Publicly posted composite documents with identity based encryption. In: Proceedings of the 11th ACM symposium on document engineering, pp 239–248
Sahai A, Waters B (2005) Fuzzy identity-based encryption. In: Advances in cryptology-EUROCRYPT. Springer, pp 457–473
Zheng Y (2011) Privacy-preserving personal health record system using attribute-based encryption. Ph.D. Dissertation, Worcester Polytechnic Institute
Wang G, Liu Q, Wu J (2010) Hierarchical attribute-based encryption for fine-grained access control in cloud storage services. In: Proceedings of the 17th ACM conference on computer and communications security. ACM, pp 735–737
Cumby CM, Ghani R (2011) A machine learning based system for semi-automatically redacting documents. In: Proceedings of the 23rd conference on innovative applications of artificial intelligence
Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertain Fuzziness Knowl Based Syst 10(5):571–588 [Online]. http://dx.doi.org/10.1142/S021848850200165X
Cumby C, Ghani R (2010) Inference control to protect sensitive information in text documents. In: ACM SIGKDD workshop on intelligence and security informatics, pp 5:1–5:7
Dwork C (2006) Differential privacy. In: Automata, languages and programming. Springer, pp 1–12
McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of Data. ACM, pp 19–30
Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 493–502
IBM InfoSphere Guardium Data Redaction System. Solution Brief. http://public.dhe.ibm.com/common/ssi/ecm/en/ims14137usen/IMS14137USEN.PDF, iBM, 2012
Mansuri IR, Sarawagi S (2006) Integrating unstructured data into relational databases. In: Proceedings of the 22nd international conference on data engineering, p 29
Chakaravarthy VT, Gupta H, Roy P, Mohania MK (2006) Efficiently linking text documents with relevant structured information. In: Proceedings of the 32nd international conference on very large data bases, pp 667–678
Murthy K, Deshpande PM, Dey A, Halasipuram R, Mohania MK, Deepak P, Reed J, Schumacher S (2012) Exploiting evidence from unstructured data to enhance master data management. PVLDB 5(12):1862–1873
Rjaibi W (2006) Label-based access control (LBAC) in DB2 LUW. In: Proceedings of the 2006 international conference on privacy, security and trust, pp 7:1–7:1
Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. Proc VLDB Endowm 1(1):115–125
Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd international conference on very large data bases, pp 139–150
Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M (March 2007) l-diversity: Privacy beyond k-anonymity. ACM Trans Knowl Discov Data. doi:10.1145/1217299.1217302
Li N, Li T (2007) t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd international conference on data engineering
FileNet Content Manager. http://www-01.ibm.com/software/ecm/filenet/, iBM Corporation
Apache Lucene. http://lucene.apache.org/
Apache Tika. http://tika.apache.org/
Apache PDFBox. http://pdfbox.apache.org/
Apache POI. http://poi.apache.org/
IBM DB2 10. http://www-01.ibm.com/software/data/db2-warehouse-10/
Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisation. In: IJCAI
Denis F (2001) Learning regular languages from simple positive examples. Mach Learn 44(1/2):37–66
Fernau H (2009) Algorithms for learning regular expressions from positive data. Inf Comput 207(4):521–541
Pöss M, Smith B, Kollár L, Larson P-Å (2005) Tpc-ds, taking decision support benchmarking to the next level. In: Proceedings of the (2002) ACM SIGMOD international conference on management of data, pp 582–587
US Department of Treasury SDN Data. http://www.treasury.gov/resource-center/sanctions/SDN-List/Pages/default.aspx
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Deshpande, P.M., Joshi, S., Dewan, P. et al. The Mask of ZoRRo: preventing information leakage from documents. Knowl Inf Syst 45, 705–730 (2015). https://doi.org/10.1007/s10115-014-0811-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0811-6