1 Introduction

Data Mining is a subfield of Knowledge discovery which can be defined (Fawley et al. 1992) as a non-trivial extraction of implicit, previously unknown, and potentially useful information from data. Data Mining can be applied in many contexts. Thus it is highly suitable to extract implicit and useful information from the very large amount of data available within the World Wide Web (Web), i.e. user behavior.

2 Book description

The reviewed book is entitled “Data Mining the Web: Uncovering in Web Content, Structure, and Usage.” It shows how Data Mining techniques can be applied to the Web. The book content is relatively close to other books dealing with Web Data Mining. It is organized into three sections that follow the book title: “Web Structure Mining” (two chapters), “Web Content Mining” (three chapters), and “Web Usage Mining” (four chapters). Every chapter includes examples and numerous exercises allowing the reader to evaluate his understanding of the chapter.

2.1 Section 1: Web structure mining

This section is composed of two chapters addressing the way documents (web pages) can be extracted from the Web by their content and the Web hyperlink structure. It mainly deals with Information Retrieval techniques.

2.1.1 Chapter 1: Information retrieval and web search

This chapter introduces main Web Information Retrieval techniques. It introduces the way information can be collected on the Web and then processed in order to retrieve and exploit it effectively. First, Web crawling techniques exploit links to collect available documents are underlined. But collecting documents is not a real aim: collected document has to be processed in order to characterize their content. To do this, the chapter then introduces Indexing and keyword search techniques. These techniques are aimed at identifying (indexing) the document content through a vector of weighted terms (relying on the vector space model). To construct this vector, this chapter deals with stop word removal (removing words that do not reflect the real content like a, to…), the way terms are weighted (thanks to tf.idf formula). The chapter also presents the way documents can be retrieved (ranked) for a specific keyword query. Many techniques aimed at improving this document ranking process are proposed. It deals with:

  • Techniques that exploit users’ feedbacks,

  • Techniques that exploit the HTML document structure to improve the document indexing. Example: text occurring in a H1 tag can be considered more important than text occurring into a P tag from the indexing point of view (tf.idf).

The two last parts of this chapter address the two main issues in Information Retrieval Field: Information Retrieval Evaluation techniques (precision/recall...) and similarity search techniques. Similarity search relies on many well-known Cosine and Jaccard measures.

To sum up this chapter introduces main techniques covering the Information Retrieval Process as a whole. It presents main techniques aiming at characterizing document content that would be exploited in other processes developed in next chapters. So, it is not surprising that this chapter is the longest one of this book.

2.1.2 Chapter 2: Hyperlink-based ranking

This chapter is very important. It concerns the way hyperlink structure existing between documents can be exploited to improve document ranking. This chapter introduces the basics of Social Networks Analysis and points out hyperlink techniques used by famous search engines like Google for instance. The PageRank (link-based ranking) and HITS (mixed content-based and link-based ranking) algorithms are illustrated as well as Authorities and Hubs concepts.

2.2 Section 2: Web content mining

This section addresses the way to turn “Web Data into Web Knowledge”. It deals with clustering and classification (categorization). These techniques combine information retrieval, machine learning, and data mining approaches to organize Web content. This section is composed of three chapters developing each side of the coin: clustering, the evaluation of clustering and classification techniques.

2.2.1 Chapter 3: Clustering

Clustering is an unsupervised learning method aimed at organizing documents into (hierarchical) clusters (groups of homogenous/similar documents). This chapter illustrates many clustering techniques such as:

  • Hierarchical agglomerative clustering (HAC) that partitions documents into hierarchical clusters,

  • k-Means clustering exploiting statistical measures to construct clusters,

  • Probability-based techniques

  • Collaborative Filtering aiming at constructing clusters of users according to their judgments/note relative to documents.

2.2.2 Chapter 4: Evaluating clustering

This chapter is directly linked to the previous one chapter since it presents common approaches used to evaluate the clusters obtained by clustering techniques. This chapter is organized around the different types of approaches aiming at measuring the “quality” of clusters:

  • Similarity-based criterion functions (sum of squared errors),

  • Probabilistic criterion functions (Category utility),

  • MDL-based model and feature evaluation (evaluating regularities in data),

  • Classes-to-clusters evaluation (accuracy)

  • Precision, recall, F-measure (error cost)

  • Entropy (“impurity” of cluster content)

2.2.3 Chapter 5: Classification

Classification consists in associate documents to existing labeled classes through supervised machine learning methods. This chapter illustrates most common classification techniques as well as evaluation of such techniques:

  • Cross-validation evaluation measure (Leave-one-out cross validation LOO-CV)

  • Nearest-neighbor algorithm (k-NN),

  • Feature selection (based on Entropy/Information gain),

  • Naïve Bayes algorithm (Probabilistic method),

  • Numerical approaches based on regression (SVM),

  • Relational learning (rule-based learning) like FOIL.

2.3 Section 3: Web usage mining

This section addresses the important issue of understanding the use of Web content to identify issues or patterns. It is principally based on log files storing the activity of different users. It is composed of four chapters covering the different steps of the data mining process: log file content and format, log file preprocessing, exploratory data analysis, and lastly, modeling for web usage.

2.3.1 Chapter 6: Introduction to web usage mining

This chapter first introduces the basics of a standard Web Usage Mining process and the click-stream analysis issue (Web Usage Mining). Then, the log files content is illustrated and every field of common log file is explained. This chapter is important for readers who want to quickly implement a Web Usage Mining tool.

2.3.2 Chapter 7: Preprocessing for web usage mining

This chapter addresses the difficult issue of preprocessing log data. Indeed, raw data are not directly exploited to limit noise and so improve the usage modeling.

This chapter illustrates the main steps:

  • Data cleaning and filtering (extracting relevant data (date/time, HTTP request, adding a data...), adding a time stamp to order each data...),

  • De-spidering the web log file (removing non-relevant data such as those that have been generated by a crawler for instance thanks user–agent data for instance),

  • User identification (labeling data with a user ID),

  • Session Identification (splitting the user data into different sessions)

Thanks to this preprocessing process, data can be analyzed.

2.3.3 Chapter 8: Exploratory data analysis for the web usage mining

This chapter is dedicated to statistical techniques that can be used in order to analyze data. It illustrates five different indicators:

  • Number of visit actions,

  • Session duration,

  • Relationship between visit actions and session duration,

  • Average time per page,

  • Duration for individual pages.

Thanks to these indicators, modeling techniques can be applied to extract unknown information like patterns, user behavior...

2.3.4 Chapter 9: Modeling for web usage mining

This chapter concerns the goal of all previous steps (data preprocessing and analysis): extracting unknown information. It illustrates clustering techniques and rule-based approaches like (regression tree and association rules) that can be applied to log data. Thus, this chapter presents:

  • Clustering method: Birch clustering algorithm,

  • Affinity analysis: A priori algorithm (association rule),

  • Discretizing the numerical variables: Binning,

  • Regression trees: Cart and C4.5 algorithm.

3 Discussion

The purpose of this book is to concretely show how Data Mining techniques can be applied to the Web in order to discover patterns in Web Content, Structure and Usage. It deals with a good range of practical, relevant and mostly current issues facing Web Data Mining. Due to concrete examples and exercises, this book represents a good coverage of Web Data Mining for advanced undergraduate and graduate level courses. Moreover since it can be read in a non-linear this book can be used in a variety of ways.

I can say that I really enjoy reading this book. The style is accessible and easy to understand. I think that an introductory statistical and computing course is required to understand all algorithms and statistical computations particularly for undergraduate students.

Due to several reasons listed before I think that this book is mainly adapted to people who want to quickly implement Web Data Mining tools. In my opinion, many aspects of this book appear as superficial. Indeed, even it is not a research book, many elements are missing.

First of all I think the bibliography is quite limited. It could be completed with more accurate references to allow the reader to obtain more precise information or more formal presentation of every presented techniques or model. For instance, Chap. 1, dedicated to Information Retrieval techniques, does not quote famous references like Baeza-Yates and Ribeiro-Neto (1999), Agosti and Smeaton (1996), or Grossman and Frieder (2004). These references may give readers many pointers to a more complete discussion about IR models or IR techniques. Additional reading relevant to many aspects detailed in this book are given in the next section (every reference is associated to a little comment that explains why I quote it). These additional readings concern association rules—Adamo (2001), Social Network Analysis—Wasserman and Faust (1994) and Web communities—Zhang et al. (2006).

Moreover it is a bit surprising that most of equations or algorithms are not associated to the related publications.

Furthermore, structured document IR (applied for instance to XML) that can be applied to the “clean” web content is not discussed in this book.

Lastly, the organization of this book implies that the reader sometimes becomes disoriented. Indeed, many concepts are repeated within the book with different definitions (such as recall/precision or entropy) and many elements are expected in a specific chapters but appear in another one. For example, I do not really understand the organization of Chap. 1. Moreover, Chap. 4 is dedicated to the clustering evaluation since Chap. 5 combines classification and evaluation.

All in all, the book is a success. Even if it is not really match with my expectations this book is really interesting for people who want to develop Web Data Mining tools. The power of this book relies on:

  • Concrete examples that illustrate concepts and algorithms. Some of them are based on common software (Weka and Clementine/SPSS)—the use of only “free” software may be more appreciated,

  • Numerous and relevant exercises (with solution on the Companion Web site),

  • An associated Companion Web site that can make the reading lively because it should gather exercises solutions/slides.... (Note: the web site is still not online when I wrote this review. Authors can send readers some slides, data collections—It works! I tried it! ☺).

For all these reason I consider this book is a great educational resource for students and teachers.

From a researcher point of view I prefer reading the book from Liu (1998) that have a similar content because the discussion is deeper and it better corresponds to my expectations.