Abstract
In this chapter, we briefly overview the relatively new discipline of computer forensics and describe an investigation of forensic authorship attribution or identification undertaken on a corpus of multi-author and multi-topic e-mail documents. We use an extended set of e-mail document features such as structural characteristics and linguistic patterns together with a Support Vector Machine as the learning algorithm. Experiments on a number of e-mail documents generated by different authors on a set of topics gave promising results for multi-topic and multi-author categorisation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Computer Security Institute (2001). “2001 CSI/FBI Computer Crime and Security Survey”, Computer Security Issues & Trends.
Salton G., and McGill M. (1983).Introduction to Modern Information FilteringMcGraw-Hill, New York.
Apte C., Damerau F., and Weiss S. (1998). “Text mining with decision rules and decision trees”, Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery.
Ng H., Goh W., and Low K. (1997). “Feature selection, perceptron learning, and a usability case study for text categorization”, Proc. 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR97), pp.67–73.
Yang Y., and Liu X. (1999). “A re-examination of text categorisation methods”, Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR99), pp.67–73.
Joachims T. (1998). “Text categorization with support vector machines: Learning with many relevant features”, Proc. European Conf. Machine Learning (ECML’98), pp.137–142.
de Vel O. (1999). “Evaluation of Text Document Categorisation Techniques for Computer Forensics”, Journal of Computer Security, (submitted).
Cohen W. (1996). “Learning rules that classify e-mail”, Proc. Machine Learning in Information Access: AAAI Spring Symposium (SS-96–05), pp.18–25.
Sahami M., Dumais S., Heckerman D. and Horvitz E., “A Bayesian approach to filtering junk e-mail”, Learning for Text Categorization Workshop: 15th National Conf. on AI. AAAI Technical Report WS-98–05, pp.55–62.
Mitchell T. (1997).Machine LearningMcGraw-Hill, New York.
Gray A., Sallis P., and MacDonell S. (1997). “Software forensics: Extending authorship analysis techniques to computer programs”, Proc. 3rdBiannual Conf. Int. Assoc. of Forensic Linguists (IAFL’97), pp.1–8.
Thomson R., and Murachver T. (2001). “Predicting gender from electronic discourse”, British Journal of Social Psychology, pp.193–208.
Mosteller F., and Wallace D. (1964).Inference and Disputed Authorship: The FederalistAddison-Wesley, Reading, Mass.
Bosch R., and Smith J. (1998). “Separating hyperplanes and the authorship of the disputed federalist papers”, American Mathematical Monthly, 105, pp.601–608.
Elliot W., and Valenza R. (1991). “Was the Earl of Oxford the true Shakespeare?”, Notes and Queries38pp.501–506.
Crain C. (1998). “The Bard’s fingerprints”, Lingua Franca, pp.29–39.
Chaski C. (1998). “A Daubert-inspired assessment of current techniques for language-based author identification”, US National Institute of Justice, available through www.ncjrs.org.
Chaski C. (2001). “Empirical evaluations of language-based author identification techniques”, Forensic Linguistics, to appear.
Rudman J. (1997). “The state of authorship attribution studies: Some problems and solutions”, Computers and the Humanities, 31, pp.351–365.
Tweedie F., and Baayen R. (1998). “How variable may a constant be? Measure of lexical richness in perspective”, Computers and the Humanities32pp.323–352.
Farringdon J. (1996).Analysing for Authorship: A Guide to the Cusum TechniqueUniversity of Wales Press, Cardiff.
Thisted B., and Efron R. (1987). “Did Shakespeare write a newly disovered poem?”, Biometrika, pp.445–455.
Lowe D., and Matthews R. (1995). “Shakespeare vs Fletcher: A stylometric analysis by radial basis functions”, Computers and the Humanities, pp.449–461.
Tweedie F., Singh S., and Holmes D. (1996). “Neural network applications in stylometry: The Federalist papers”, Computers and the Humanities30pp.1–10.
Waugh S., Adams A., and Tweedie F. (2000). “Computational stylistics using artificial neural networks”, Literary and Linguistic Computing15pp.187–198.
Holmes D., Forsyth R. (1995). “The Federalist revisited: New directions in authorship attribution”, Literary and Linguistic Computing, pp.111–127.
Khmelev D. (2000). “Disputed authorship resolution using relative entropy for Markov chain of letters in a text”, Proc. 4th Conference Int. Quantitative Linguistics Association, R. Baayen (Ed.), Prague.
Spafford E., and Weeber S. (1993). “Software forensics: tracking code to its authors”, Computers and Security12pp.585–595.
Oman P., and Cook C. (1989). “Programming style authorship analysis”, Proc. 17th Annual ACM Computer Science Conference, pp.320–326.
Krsul I., and Spafford E. (1997). “Authorship analysis: Identifying the author of a program”, Computers and Security16p..248–259.
Krsul I. (1994). “Authorship analysis: Identifying the author of a program”, Technical Report CSD-TR-94–030, Department of Computer Science, Purdue University.
Sallis P., MacDonell S., MacLennan G., Gray A., and Kilgour R. (1997). “Identified: Software authorship analysis with case-based reasoning”, Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pp.53–56.
Foster D. (2000).Author Unknown: On the Trail of AnonymousHenry Holt, New York.
de Vel O. (2000). “Mining e-mail authorship”, Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD’2000), Boston.
Anderson A., Corney M., de Vel O. and Mohay G. (2001). “Identifying the Authors of Suspect E-mail”, Computers and Security, (submitted).
Vapnik V. (1995).The Nature of Statistical Learning TheorySpringer-Verlag, New York.
Druker H., Wu D. and Vapnik V. (1999). “Support vector machines for spam categorisation”, IEEE Trans. on Neural Networks 10 pp.1048–1054.
Teytaud O. and Jalam R. (2001). “Kernel-based text categorization”, International Joint Conference on Neural Networks (IJCNN’2001), Washington DC.
Diederich J., Kindermann J., Leopold E. and Paass G. (2000). “Authorship attribution with Support Vector Machines”, Applied Intelligence, (submitted).
de Vel O., Anderson A., Corney M., and Mohay G. (2001). “Mining Email Content for Author Identification Forensics”, SIGMOD Record, 30(4)
SVMLight (2001). Support Vector Machine software, University of Dortmund, Germany.
Witten I., and Frank E. (2000).Data Mining: Practical Machine Learning Tools and Techniques with Java ImplementationsMorgan Kaufmann, San Francisco.
Yang Y. (1999). “An evaluation of statistical approaches to text categorization”, Journal of Information Retrieval, 1, pp.67–88.
Friedman J. (1991). “Multivariate adaptive regression splines”, Annals of Statistics, 19, pp.1–141.
Hastie T., Tibshirani R., and Friedman J. (2001).The Elements of Statistical Learning: Data Mining Inference and PredictionSpringer Series in Statistics, Springer-Verlag, New York, NY.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media New York
About this chapter
Cite this chapter
de Vel, O., Anderson, A., Corney, M., Mohay, G. (2002). E-Mail Authorship Attribution for Computer Forensics. In: Barbará, D., Jajodia, S. (eds) Applications of Data Mining in Computer Security. Advances in Information Security, vol 6. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0953-0_9
Download citation
DOI: https://doi.org/10.1007/978-1-4615-0953-0_9
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-5321-8
Online ISBN: 978-1-4615-0953-0
eBook Packages: Springer Book Archive