1. Introduction
In the real-world, many scenarios, such as transportation networks, social networks, communication networks, etc., can be effectively stored and represented using a graph structure. In the graph, nodes represent the objects in the scenario and edges represent the relationships between objects [
1]. Therefore, research can be conducted on analyzing the structure and properties of graph structures to understand such real-world complex systems. With the increasing complexity of real-world data, graph structures are characterized as massive, high-dimensional, sparse, heterogeneous, complex, and dynamic [
2]. Both symmetry and asymmetry information exist in the graph. For example, in the social network, the marriage relationship is symmetric while the filiation relationship is asymmetric [
3]. Therefore, how to learn useful knowledge from the graph has become one of the hot spots of research in artificial intelligence.
Graph embedding is an important technical tool for graph knowledge discovery. It aims to map graph data into a low-dimensional vector that can correctly represent some important information of the original graph [
4]. Moreover, graph embedding enables practical application problems such as node/graph classification [
5], node clustering [
6], connection prediction [
7], and recommendation systems [
8]. Initially, much of the research on graph embedding focuses on the homogeneous networks, which consist of only one node type and edge type [
9]. However, the relationships in the real-world are naturally modeled as heterogeneous networks, which contain richer and more complex semantics and structures. As more than one node type and edge type are provided, the graph embedding techniques for homogeneous networks cannot be directly applied to heterogeneous networks [
10]. Therefore, how to learn embedding of heterogeneous networks has become one of the hot topics of research in artificial intelligence [
11].
Initially, matrix decomposition methods [
12], i.e., singular value decomposition (SVD) [
13] and non-negative matrix factorization (NMF) [
14], are used to generate potential dimensional features in heterogeneous networks. The main idea of this kind of method is to represent the association information of a graph as a matrix, and then matrix decomposition is implemented on the association matrix to generate a low-dimensional vector of each node. However, the computational cost of decomposing large-scale matrices is usually very expensive, and there are also statistical performance drawbacks [
15]. At the same time, matrix decomposition is not easy to incorporate features of different types of nodes and contexts, which makes it impossible for matrix decomposition to obtain enough effective information. Subsequently, random walk-based approaches are proposed to learn the embedding of nodes in the graph. This kind of method evolves from the Word2vec model in natural language processing [
16], where different walk strategies are used to obtain the node sequence of the graph, and then the Skip-Gram word embedding model is used to complete the node embedding learning. For example, DeepWalk [
17] first randomly generates the neighbors of nodes in the network to form a fixed-length random walk sequence, while Node2vec model [
18] obtains the random walk sequence by adjusting the parameters of Breadth-First Search (BFS) and Depth-First Search (DFS) strategies. However, these two random walk strategies do not distinguish the forward direction, and when applied to a heterogeneous graph with various node types, it is easy to lead to statistical bias. Therefore, Metapath2vec [
19] is proposed to generate a sequence of heterogeneous nodes by introducing meta-paths to guide random walk, so that it can be used to capture the semantic and structural information between different types of nodes to ensure the correctness of semantic changes, and then the nodes in the sequence are treated as words and the Skip-Gram model is adopted to learn the embeddings of the heterogeneous graph. Indeed, Metapath2vec essentially transforms the heterogeneous graph embedding into a word embedding problem and can fully capture the structural and semantic relevance of different types of nodes and relations. Among them, word embedding models are the key to achieving good embedding results. For example, the Skip-Gram model [
20] is a local context window approach for learning word vectors and is a mainstream model for word vector learning. The global matrix decomposition model [
21] (e.g., LSA) is a word vector learning method, which is trained on separate local context windows. However, the former may be better at analogy tasks, but they are trained in separate local context windows with limited window lengths and do not make good use of global statistics of the data. Especially, the latter makes effective use of the statistical information but fails to capture the semantic information and so performs poorly in tasks such as the similarity of words [
22]. Then, Global Vectors for Word Representation (GloVe) model [
23], which incorporates the global statistical information of matrix decomposition (LSA), has the advantage of co-occurrence windows and global priori statistical information. Its core idea is to use the number of co-occurrences between words for training. It is efficient in training the learning of word vectors and can carry more semantic information. Zheng Yanan [
24] and others use GloVe to extract text features compared with Word2vec word vector, and then use SVM to classify text. Experiments show that using GloVe to extract features in text classification is better. Shishir Kulkarni [
25] and others use second-order random walk to create a corpus, and then generate node embeddings of the graph by transplanting the GloVe algorithm. Experiments show that the feature extraction by GloVe is also effective in extracting node features of the graph.
In this paper, an algorithm named Metapath–GloVe is proposed to improve the efficiency of the Metapath2vec algorithm. The main idea of this method is to train the heterogeneous node sequences, which are obtained by random walk based on Metapath2vec algorithm, and then use the GloVe model to complete the learning of node embedding in heterogeneous graphs [
26]. Following that, the proposed algorithm is applied to analyze the MOOC course data, which are collected from the MOOC online learning platform [
27]. The dataset is modeled into a heterogeneous graph by extracting the users, videos, course, and the asymmetric relationship among them. Based on the constructed heterogeneous graph, the proposed algorithm is used to transform the structured information of the heterogeneous graph into vector information. Subsequently, based on the learned user and video embedding vectors, a joint Spectral Co-clustering algorithm is used to perform association analysis of users and videos. Then, a classification-based link prediction model is constructed by obtaining the embedding vectors of user–video links to make video recommendations to users.
The main contributions of this paper can be summarized as follows: (1) A heterogeneous graph embedding learning algorithm is proposed by training GloVe word embedding model on meta-path random walks, which can improve the efficiency of heterogeneous node embedding learning. (2) The proposed algorithm is applied on the MOOC course data to learn embeddings of users and videos, which lays the foundation for the realization of subsequent analysis tasks. (3) The proposed method has better results than traditional methods for user and video association analysis and video recommendation experiments on MOOC course data.
The rest of the paper is organized as follows:
Section 2 introduces an improved Metapath2vec algorithm proposed in this paper in the context of the MOOC data used in this paper;
Section 3 presents the application of the algorithm proposed in this paper to MOOC data analysis;
Section 4 gives the experimental results of this paper and its analysis;
Section 5 makes a summary of the whole paper.
2. Learning MOOC Course Data Node Embedding Based on Metapath–GloVe
In this section, the proposed algorithm Metapath–GloVe is applied to the context of MOOC data. Firstly, the relationship between users, videos, and courses in MOOC data is extracted, and a heterogeneous graph of MOOC data is constructed. Secondly, random walk sequences, which contain nodes in the type user and video, are obtained from the heterogeneous graph by implementing meta-path random walk. Then, in view of the advantages of the GloVe model, this model is used to learn the node embedding of users and videos based on the meta-path sequences. Finally, an overall framework of the proposed algorithm for learning node embeddings in heterogeneous graphs is given.
2.1. The Construction of Heterogeneous Graph from MOOC Course Data
In general, a graph is structured data, which is essentially a collection of vertices or nodes connected together by edges. Usually, a graph is represented by , where and represent the set of all nodes and the set of all edges, respectively. Then, the types of nodes and the types of edges in the graph are described by defining the node type mapping functions such that , are the sets of node types, and the edge type mapping functions , , are the sets of edge types, respectively. Generally, graphs can be classified into homogeneous and heterogeneous graphs according to the types of nodes and edges. Homogeneous graphs refer to graphs with only one node type and one edge type, i.e., , which usually have a simple network structure. While heterogeneous graphs have more than one node type or edge type, i.e., , which contain richer semantic information and more complex network structure.
In this paper, users, videos, courses, and their relationships are extracted from MOOC course data, and a heterogeneous graph is built as shown in
Figure 1. The heterogeneous graph contains three types of nodes: users, videos, and courses, and different types of relationships exist between different nodes. Specifically, if a user watches a video, we consider that there is a link between the user and the video, and if the video belongs to a course, we consider that there is a link between the video and the course. In the following experiments, take this heterogeneous graph as the object, we carry out the main analysis of the relationship between the users and the videos, and courses are treated as labels to evaluate the proposed algorithms. Thus, we adopt
to represent the object graph, where
and
stand for users and videos, respectively, and
is the relationship between users and videos.
2.2. The Acquisition of Meta-Path Sequences Based on Heterogeneous Random Walk
When a heterogeneous graph is obtained, heterogeneous random walk is adopted to acquire node sequences with multiple types. Firstly, a pre-set meta-path scheme
is introduced in the heterogeneous graph to guide the process of random walk to generate node sequences that capture the semantic and structural correlations between different types of nodes. As shown in Equation (1), it is a defined meta-path scheme
.
where
denotes the combination of meta-paths between node
and node
. Finally, a meta-path scheme can be generated from the combination with a higher-order relation (nodes in different types). It is able to capture more complex and rich semantic relations than the first-order relation (nodes in same type) approach.
When a meta-path scheme
is given, the
-th transition probability of the meta-path random walk is defined in Equation (2).
where
is the node of type
,
denotes neighbors of node
in type
. Obviously, random walk is a biased under the predefined meta-path scheme
. Thus, meta-path based random walk can be used to capture the semantic and structural information among different types of nodes to ensure the correctness of semantic changes, and also to effectively avoid the statistical bias caused by the high proportion of a certain type of nodes, so that the heterogeneous graph can be effectively integrated into the subsequent model to complete the embedding of nodes.
In our MOOC course heterogeneous graph
, two meta-paths are constructed to form a meta-path scheme as shown in Equation (3).
In details, describes the relationship between users who have clicked on the same video describes the relationship among videos that have been clicked by the same user. Biased random walks are guided by the meta-path scheme to obtain the node sequences with multiple node types.
2.3. Learning Node Embedding Based on GloVe Model
When node sequences are obtained, they are treated as corpus and words, and GloVe model [
18] is adopted to learn the node embeddings. The GloVe model adopts both the overall statistics feature and the local context feature of the corpus. The progress of the GloVe model is shown as follows.
Step 1: The node sequences generated by random walk are treated as corpus and set as the input of the model.
Step 2: A global co-occurrence matrix is generated. A context window is set to traverse from the beginning to the end of the corpus, and then the number of simultaneous occurrences of two nodes are counted. For example, the co-occurrence matrix is represented as with element denote the number of times the words and appear together in a window.
Step 3: An approximate relationship between the word vectors and the co-occurrence matrix is constructed as shown in Equation (4).
where
and
are the word vectors of the word
, and
,
, and
are the bias terms of the two word vectors. The equation means the inner product of word vectors converge to the logarithmic value of the co-occurrence matrix.
Step 4: The loss function
is constructed based on the approximate relationship using the mean square error, as shown in Equation (5).
where
is the dimension of the co-occurrence matrix. Here,
is the weight function with the characteristics shown in Equation (6).
is non-degeneracy, and the weights do not decrease as the number of co-occurrences increases. In this paper,
.
In this paper, AdaGrad [
28] is adopted to train the model. Initially, all non-zero pairs
and
in the co-occurrence matrix
are randomly sample as training data, the corresponding word vectors
,
and two biases
and
are randomly initialized. Then the initial loss value can be calculated. In the following, the gradient can be calculated by a given learning rate; thus,
,
,
,
can be updated. The above process continues until the given iteration conditions reached. Finally, the output vectors are the embeddings feature representation of each node. This obtained embeddings can correctly express the semantic information of the whole heterogeneous network and the relationship between each node, and can be used for node classification output, clustering, or similarity search.
2.4. An Overall Process of Metapath–GloVe Algorithm
In this paper, the proposed Metapath–GloVe algorithm is used to analyze the MOOC course data, which are collected from the MOOC platform. The overall process of learning the node embeddings of MOOC course data based on Metapath–GloVe is shown in
Figure 2. Firstly, users, videos, courses, and the relationship among them are extracted from MOOC coursed data to construct a heterogeneous graph of MOOC course data as shown in
Figure 1. In the graph, we focus on the relationship between users and videos, while courses are treated as labels for evaluation. Then, meta-paths are pre-set to guide random walk to obtain heterogeneous node sequences. The obtained heterogeneous node sequences are input to the GloVe model as a text corpus, and the co-occurrence matrix is obtained statistically. Then, the word vector training is performed based on this co-occurrence matrix, and the final embedding feature vector matrix is obtained through the optimization of the loss function. Based on the learned node embeddings, tasks such as association segmentation and user–video link prediction can be achieved.
3. Data Analysis of MOOC Course Data Based on Metapath–GloVe
After learning the node embeddings based on Metapath–GloVe, two methods are proposed to analyze the MOOC course data: the node embedding-based association partitioning method and the node embedding-based user–video link prediction method. The details are shown in the following.
3.1. User/Video Association Analysis
Based on the learned node embeddings, the clustering algorithm can be adopted to achieve association analysis. The specific experimental process is shown in
Figure 3.
Figure 4 demonstrates the effect of clustering analysis on MOOC course data. In details,
Figure 4a shows the adjacency matrix of a graph,
Figure 4b shows the learned node embedding feature matrix, and
Figure 4c shows the result of clustering analysis. It can be seen from the figures,
Figure 4a,b do not contain the regular information, while in
Figure 4c, the association between users and videos can be extracted easily after clustering analysis.
In this paper, Spectral Co-clustering [
29] is implemented on the embeddings of the user and video to perform association partitioning. The processing is shown as follows.
Step 1: An input matrix is constructed by set users and videos as the rows and columns of the matrix, respectively. In the matrix, each element can be obtained by calculating the cosine similarity of corresponding user and video embeddings.
Step 2: The input matrix
is preprocessed as follows.
where
is a diagonal matrix, where the elements
are equal to
, and
is a diagonal matrix, where the elements
are equal to
.
Step 3: Singular value decomposition is implanted on
A as shown in Equation (8).
It yields a partition of in rows and columns, which a subset of the singular vectors on the left gives row partitions, and a subset of the singular vectors on the right gives column partitions.
Step 4: Calculate matrix
, which provides the required partitioning information by Equation (9).
Among them, the columns of are , and the columns of also have similar characteristics.
Step 5: The clustering result can be obtained by using k-means on all rows of .
3.2. User–Video Link Prediction Method Based on Node Embeddings
In this section, link prediction task is implemented on the MOOCCube data to evaluate the performance of Metapath–GloVe algorithm. Users are treated as nodes U, videos are treated as nodes V, and the relation between them are set as user–video links (U-V). The process of link prediction is shown in
Figure 5.
Step 1: The user–video pairs with existing links are considered as positive node pairs, forming a set E of link relationships, and all information is integrated to build a user–video bipartite graph .
Step 2: Randomly select 20% of positive node pairs as test samples and the remaining 80% of positive node pairs as training samples.
Step 3: All unlinked user–video pairs are treated as negative links, from which the same number of negative links are randomly selected to form the test and training sets, respectively.
Step 4: Generate a new user–video bipartite graph by removing the positive test link and learn the node embeddings of by using Metapath–GloVe.
Step 5: The edge embedding values of all positive and negative node pairs samples in the training and test sets are computed as the following formula.
where
is the feature vector of user nodes in a link, and
is the feature vector of video nodes in this link.
Step 6: The positive node pairs in the training and test sets are assigned a label value of 1, and the negative node pairs are assigned a label value of 0.
Step 7: The embedding values are used as input and the label values are used as output, which are input to different classifiers for training.
Step 8: The trained classifiers, i.e., Bagging classifier, Stacking classifier, and Neural Network (MLP) classifier, are used to predict the links in the test set for link prediction.