1 Introduction

Network embedding aims to learn a low-dimensional continuous representation for each node in the network, given their link information. The obtained vector can reconstruct the intrinsic properties in the original network and can further be fed into traditional machine learning algorithms as features for the downstream network analysis tasks, such as node classification, clustering, community detection.

However, nodes in the networks often attach with a large number of attributes besides links in the real world, and they often provide vital information, such as vocabularies in citation networks and user profiles in social networks. So traditional network embedding methods that consider network structure alone can hardly capture enough information and result in unsatisfactory performance. Take an intuitive sample. A paper is more likely to fall into the same category with the paper that involves a similar vocabulary, which is difficult to judge by reference relationship alone. Therefore, learning an embedding vector for each node to effectively preserve the heterogeneous information of network structure and attributes becomes a significant issue, which we call attributed network embedding.

Recently, there has been a surge of related arts with great success [11, 20, 31]. Despite their success, there are still three main problems remain. (1) Network structure consists of local structure such as first-/second-order neighbors and global structure like high-order proximity. Many attributed network embedding methods focus only on one of them, which causes insufficient network structure mining. AANE [10] uses Laplacian Eigenmaps for the learned representations, so it only pays attention to the first-order neighbors of each node. SEANO [13] leverages the Skip-gram model to preserve the network structure. Thus, the learned node representation focuses more on higher-order proximity and little attention to local structure. (2) To some extent, network structure and attributes are highly correlated, and there exists potential associations between them, which we call affinity. Nevertheless, most attributed network embedding methods seldom take it into account. For example, TADW [31] performs matrix factorization on the text information based on DeepWalk [21], and then combines the resulting topology structure embedding and attribute embedding as the final network representation. CAN [17] adopts variational autoencoders and regards the adjacency matrix and the attribute matrix as targets without considering the correlation between them. (3) The learned network representation should be discriminative and more suitable for various downstream network mining tasks, while some methods [12, 26] are end-to-end frameworks and hard to be applied to general tasks. More recently, graph neural networks have attracted much attention [12, 26, 30, 35], and label information is required for them to ensure the validity of the learned representation. In the real world, however, labels are rarely provided, so an unsupervised approach makes more sense compared to the supervised one.

To address these issues mentioned above, we propose a novel unsupervised structure-guided attributed network embedding with “centroid” enhancement (SANECE) to learn a discriminative embedding while preserving network structure and attribute information effectively and seamlessly. Expressly, we assume that each node has some similarity with its neighbor nodes. Thus we regard the local neighborhood of each node as a “cluster” and passing by a graph convolutional network, we calculate a local “centroid” for each node. Based on the “centroid”, we conduct two objectives. For the first one, motivated by K-means, we impose a gap constraint between each node and its corresponding “centroid”. More specifically, we leverage a triplet regularizer and propose two negative sampling strategies based on network structure or attributes to find the appropriate negative samples. The second is to enrich the node with its “centroid” to obtain complete information, and then feed it into a Skip-gram model to preserve the global structure of the network. By jointly optimizing the two building blocks, the final obtained embedding can effectively reconstruct the local-global network structure. Moreover, attribute information relies on the guided aggregation of network structure, so our SANECE model can capture the affinity underlies them to learn a smooth representation for each node. To sum up, our contributions are as follows:

  • We propose a novel unsupervised attributed network embedding model that effectively yet efficiently preserves network structure and attribute information while capturing the affinity between them. Besides, SANECE can learn a discriminative network representation and is favor of downstream tasks.

  • We design two objectives based on “centroid”, and by optimizing them, the learned representation can reconstruct local- and global-network structure. The ablation study shows that these two modules are complement to each other, and they both play important roles in learning high-quality network representation.

  • We conduct extensive experiments on eight real-world attributed networks to verify the performance of our proposed model in terms of node classification, node clustering, and network visualization. The experimental results demonstrate the rationality of our model compared to the state-of-the-art methods.

2 Related work

Network embedding aims at learning a representative, low-dimensional embedding for a high-dimensional network to support downstream tasks [5]. It can be traced back to some earlier manifold learning algorithms such as LE [1], LLE [22], Isomap [25], which can extract low-dimensional manifold structures embedded in high-dimensional data. In recent years, network embedding has attracted much more research interests to deal with network data. DeepWalk [21] first introduces the idea of word embedding [18] into network embedding by performing truncated random walks and a Skip-gram model. Based on DeepWalk, node2vec [7] further improves the random walk by introducing parameters to explore the relevance and isomorphism between nodes. LINE [24] sets the lost functions for first- and second-order proximity, respectively. Grarep [3] decomposes the high-order normalized adjacency matrix through singular value decomposition. However, Its time complexity is too high to be applied to large-scale networks. NEU [32] proposes a simple yet efficient framework to preserve high-order proximity. Besides, HOPE [19] focuses on preserving asymmetric relationship information in the network. M-NMF [29] integrates the mesoscopic structure of the community into the embedding. However, all of the algorithms above only take network structure into account but ignore the attribute information.

Some existing works incorporate node attributes in addition to topological structure to obtain high-quality network embeddings. TADW [31] first considers text information based on DeepWalk under the framework of matrix factorization. Pan et al. [20] propose a tri-party network embedding algorithm that utilizes node information, text information, and label to learn the network representation. AANE [10] learns the joint embedding for both structure and attribute information through matrix factorization in a distributed way. However, they are all shallow models and are hard to capture the complex semantic feature under data. Recently some models have proposed specific neural networks for attribute network embedding. By leveraging a neighbor enhancement stacked autoencoder and an attribute-aware Skip-gram model, ANRL [34] can capture the complicated relationship between structure and attributes. CAN [17] makes use of variational autoencoders to learn a unified representation for structure and attribute information. Nonetheless, it is hard for it to capture the affinity between topological structure and attributes. Other similar works include DANE [6], attri2vec [33], PGE [9].

Recently, graph neural networks have also attracted considerable attention. GCN [12] is a classic representative, which learns the representation of each node by gathering information of neighbors. GAT [26] allocates attention to each neighbor based on GCN. DGCN [35] uses a siamese GCN and adds a regularizer to its two outputs to align them. DEMO-Net [30] proposes a degree-specific graph neural network to solve the problem of node- and graph-level classification in attributed networks. Nevertheless, they are all end-to-end methods and need to be provided with label information of some nodes. Hamilton et al. [8] propose GraphSAGE, which can deal with unseen nodes while aggregating the features of local neighbors into the target node. DGI [27] is another unsupervised graph neural network method, the main idea of which is to maximize the mutual information between the node’s local patch and high-level graph representation.

Fig. 1
figure 1

The overview framework of SANECE

3 Method

3.1 Problem definition

Let \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}},{\mathbf {X}})\) denotes an attributed network, where \({\mathcal {V}}=\{v_1,v_2,...v_n\}\) is the set of n nodes, \({\mathcal {E}}\) represents the set of edges between the nodes. \({\mathbf {X}}\in {\mathbb {R}}^{n \times m}\) is the node attribute matrix, where m is the number of attributes of each node and \({\mathbf {x}}_i\) represents the attribute vector of \(v_i\). \(\varGamma (\cdot )\) denotes the neighbor set of the particular node, and \(\Vert \cdot \Vert _F\) is the frobenius norm. We can define the attributed network embedding problem as follows:

Definition 1

(Attributed network embedding) Given an attributed network \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}},{\mathbf {X}})\), the attributed network embedding aims to learn a representation \({\mathbf {h}}_i\in {\mathbb {R}}^{d}\) for each node \(v_i\in {\mathcal {V}}\), where d (\(d < m\)) is the dimension of the embedding, such that the obtained embedding matrix \({\mathbf {H}}\) can preserve both network structure and attribute information of original network \({\mathcal {G}}\) and support the downstream network mining tasks successfully.

3.2 The proposed model: SANECE

In this section, we introduce a novel unsupervised attributed network embedding approach named SANECE. Figure 1 provides an overview of it. In the figure, we feed the graph structure and attribute information into graph convolutional networks (GCN) and fully connected networks (FC) to calculate the “centroid” and individual low-dimensional representation, respectively. Then we perform two objectives based on the obtained “centroid”, which are a “centroid”-based triplet regularizer and a “centroid”-augment Skip-gram model to preserve the local-global network structure and attribute information. The details will be discussed in the following.

3.2.1 Preprocess

Before performing various information extraction, we preprocess the original attribute vector of each node with tf-idf (term frequency inverse document frequency), for it can evaluate the importance of each attribute in the attribute set to obtain a more efficient attribute vector. In the following, we use \({\mathbf {x}}_i\) to represent the processed attribute vector.

3.2.2 Neighborhood “centroid” encoder

We believe there exists some potential similarity between the nodes in the network and their local neighbors. For example, in a citation network, most of the papers in the second-order reference relationship are about the same topic. In this paper, we regard each neighborhood within the range of first- and second-order neighbors as a “cluster”, and employ a two-layer GCN [12] to calculate the “centroid” of it for it can effectively aggregate the information in the neighborhood. We input the adjacent and attribute vector into GCN, and during the propagation process, the attribute information of the neighborhood is aggregated under the guidance of network structure. In particular, the two-layer GCN can be defined as:

$$\begin{aligned} \begin{aligned}&{\mathbf {Y}}^{(1)} = \mathrm{ELU}(\tilde{{\mathbf {A}}}{\mathbf {X}}{\mathbf {W}}_{G}^{(1)})\\&\quad {\mathbf {Y}}^{(2)} = \tilde{{\mathbf {A}}}{\mathbf {Y}}^{(1)}{\mathbf {W}}_{G}^{(2)} \end{aligned} \end{aligned}$$
(1)

where \({\mathbf {y}}^{(2)}_i\) is the low-dimensional representation of “centroid” for node \(v_i\) obtained by GCN, \(\mathrm{ELU}\) [4] is a non-linear activation function. \(\tilde{{\mathbf {A}}}={\mathbf {D}}^{-\frac{1}{2}}\mathbf {{\hat{A}}} {\mathbf {D}}^{-\frac{1}{2}}\) is the normalized adjacency matrix with \(\hat{{\mathbf {A}}} = {\mathbf {A}}+ {\mathbf {I}} \), \({\mathbf {I}}\) is the Identity matrix. \({\mathbf {D}}\) is a diagonal matrix where \({\mathbf {D}}_{ii}=\sum _{j}\hat{{\mathbf {A}}}_{ij}\) and \({\mathbf {W}}_{G}^{(\cdot )}\) is the trainable convolutional filter of different layers.

The two-layer graph convolution operation aggregates the neighborhood attribute information in two hops, which is able to preserve local network structure well. In this way, we can obtain a low-dimensional vector that represents the attribute information of the first- and second-order neighbors of each node, which we call it local “centroid”.

3.2.3 Individual attributes encoder

The other input of SANECE model is the incidental attributes of each node. we feed them as inputs, passing by an FC network:

$$\begin{aligned} \begin{aligned}&{\mathbf {Q}}^{(l)}= \mathrm{ELU}({\mathbf {W}}_F^{(l)}{\mathbf {Q}}^{(l-1)}+{\mathbf {b}}^{(l)}), \end{aligned} \end{aligned}$$
(2)

where \({\mathbf {W}}_{F}^{(\cdot )}, {\mathbf {b}}^{(\cdot )}\) is the weight and bias of the fully connected neural network, \({\mathbf {q}}_i^{(l)}\) denotes the low-dimensional representation for \(v_i\) after i-th layer propagation of FC. It is worth noting that the input of the first layer \({\mathbf {Q}}^{(0)}\) is the attribute matrix \({\mathbf {X}}\).

3.2.4 “Centroid”-based triplet regularizer

Motivated by K-Means, which is a classic distance-based clustering algorithm, we apply a gap constraint after obtaining each node and its corresponding “centroid” to control the training process of the neural networks (GCN and FC). Specifically, for a particular “centroid”, we adopt a triplet regularizer to reduce its distance from the center node of the “cluster” while increasing its distance from the nodes outside the “cluster”. For a particular node \(v_i\) and its “centroid” \({\mathbf {y}}_i^{(2)}\), we consider \({\mathbf {q}}_i^{(l)}\) as a positive sample and utilize all other nodes outside the “cluster” as negative samples. We define the triplet lost function as follows:

$$\begin{aligned} \begin{aligned} L_{cen} =&\sum _{v_i\in {\mathcal {V}}}[\Vert {\mathbf {y}}_i^{(2)}-{\mathbf {q}}_i^{(l)}\Vert _F^2 - \frac{1}{|{\mathcal {N}}_i|}\sum _{v_j\in {\mathcal {N}}_i }\Vert {\mathbf {y}}_i^{(2)}-{\mathbf {q}}_j^{(l)}\Vert _F^2\\&+ \alpha ]_+, \end{aligned} \end{aligned}$$
(3)

where \({\mathcal {N}}_i\) denotes the node set outside the “cluster” centered on \(v_i\), and \(\alpha \) represents the margin value.

However, directly calculating \(L_{cen}\) is expensive because all the other nodes outside the “cluster” are used as negative samples. Besides, it is irrational since similar nodes that are far away from \(v_i\) are also treated as negative samples and are stretched away in embedding space. A simple solution is to randomly select some negative samples. Therefore, the efficiency can be effectively improved, but the rationality of negative sample selection has not been solved. Our solution is to select negative samples from the perspective of structure or attribute.

Structural negative sampling For the structural one, we adopt the Adamic/Adar (AA) Index as a metric to select negative samples by assigning a weight value to each node according to the degree of the common neighbor nodes. Formally, the AA Index can be calculated as:

$$\begin{aligned} A(v_i,v_j)=\sum _{v_k\in \varGamma (v_j)\cap \varGamma (v_j)}\frac{1}{\mathrm{log}|\varGamma (v_k)|}, \end{aligned}$$
(4)

Easy to find that the AA Index can effectively measure the structural similarity of two nodes by calculating the degrees of their common neighbors. We define the set of 1-hop and 2-hop neighbor nodes centered on each node as a “cluster”, and find negative nodes through the specific negative sampling strategy outside the “cluster”. Therefore, we can calculate the similarity (by AA Index) between the node pair of the center node and each node outside the “cluster” to obtain the negative node. However, for there exists some structural outliers in the network, such as nodes with a degree of 0, and selecting such outliers as negative samples is far less meaningful than other negative nodes, so we do not retrieve the entire graph to prevent excessive attention to structural outliers. On the other hand, searching in the whole graph will lead to low efficiency problem. Therefore, we redefine the range of negative sampling from outside the “cluster” to the set of 3-hop neighbor nodes of the given center node, and select the node with the lowest AA Index to the center node as the negative node.

Attributed negative sampling As for attributed negative sampling strategy, we also define the set of 1-hop and 2-hop neighbor nodes as a “cluster”, and for center node \(v_i\), we choose the node with the largest distance from its attribute vector \({\mathbf {x}}_i\) outside the “cluster” as the attributed negative node. Similar to structural negative sampling strategy, in order to avoid excessive attention to attributed outliers (the nodes with abnormal attribute vectors) and excessive calculation, we change the negative sampling range from the entire graph to the 3-hop neighbors of each center node, we use cosine distance in this paper.

In this way, the lost function for the triplet regularizer can be further improved as:

$$\begin{aligned} \begin{aligned} L_{cen} = \sum _{v_i\in {\mathcal {V}}}[\Vert {\mathbf {y}}_i^{(2)}-{\mathbf {q}}_i^{(l)}\Vert _F^2 - \Vert {\mathbf {y}}_i^{(2)}-{\mathbf {q}}_j^{(l)}\Vert _F^2\ + \alpha ]_+, \end{aligned} \end{aligned}$$
(5)

where \(v_j\) is the negative sample selected by the strategy mentioned above.

3.2.5 “Centroid”-augment skip-gram model

So far, we have solved the problem of local network structure preservation. However, there still exist two main challenges. First, the model is still incapable of dealing with higher-order similarity. Second, because of the graph convolution operation, the learned network representation may be over-smoothing, which is to the disadvantageous of downstream network mining tasks. To this end, we propose a “Centroid”-augment Skip-gram Model. We conduct truncated random walks to imitate sentences in NLP as previous works [7, 21], and feed it into the Skip-gram model. Given a particular node, Skip-gram aims to maximize the probabilities of the center node \(v_i\) and the context nodes \(v_c\). The lost function is defined as:

$$\begin{aligned} L_{glo}= -\sum _{v_i\in {\mathcal {V}}}\sum _{v_c\in {\mathcal {C}}_i} \log p(v_c| {\mathbf {y}}_i^{(2)}, {\mathbf {q}}_i^{(l)}) \end{aligned}$$
(6)

where \({\mathcal {C}}_i\) is the set of context nodes with a window size b for \(v_i\) generated by truncated random walks, \(p(v_c| {\mathbf {y}}_i^{(l)}, {\mathbf {q}}_i^{(l)})\) is the likelihood of the target context node \(v_c\) given the center node \(v_i\). Here we take both individual and “centroid” representations of \(v_i\) as input to enrich each node’s information so that the performance of the Skip-gram model can be improved. The probability can be formally defined as:

$$\begin{aligned} p(v_c|{\mathbf {y}}_i^{(l)}, {\mathbf {q}}_i^{(l)}) = \frac{\exp {({\mathbf {v}}_c^\mathrm {T}({\mathbf {y}}_i^{(2)}\oplus {\mathbf {q}}_i^{(l)}))}}{\sum _{v\in {\mathcal {V}}}\exp {({\mathbf {v}}^{\mathrm {T}}({\mathbf {y}}_i^{(2)}\oplus {\mathbf {q}}_i^{(l)}))}}, \end{aligned}$$
(7)

where \({\mathbf {v}}_c\) denotes the representation of \(v_c\) when it acts as a context node, \(\oplus \) is the concatenate operation. Given the individual representation \({\mathbf {q}}_i\) and the “centroid” representation \({\mathbf {y}}_i\), we concatenate them together as the enriched node representation of \(v_i\). By executing the dot product of the representation vector of \(v_i\) and the representation vector of target context node \(v_c\), we obtain the similarity between \(v_i\) and the target context node \(v_c\). Through the softmax function, taking the sum of the similarity between \(v_i\) and all other nodes as the denominator, the probability of each target context node \(v_c\) given \(v_i\) can be calculated. Due to the high computational complexity, we replace Eq. (6) through the negative sampling strategy proposed in Word2vec [18]:

$$\begin{aligned} \begin{aligned} L_{glo} =&-\sum _{v_i\in {\mathcal {V}}}\sum _{v_c\in {\mathcal {C}}_i} \{\log \sigma ({\mathbf {v}}_c^\mathrm {T}({\mathbf {y}}_i^{(2)}\oplus {\mathbf {q}}_i^{(l)}))\\&+\sum _{s=1}^{|neg|}{\mathbb {E}}_{v^{'}\sim P_{n}(v)}\log \sigma (-{{\mathbf {v}}^{'}}^{\mathrm {T}}({\mathbf {y}}_i^{(2)}\oplus {\mathbf {q}}_i^{(l)})), \end{aligned} \end{aligned}$$
(8)

where \(\sigma (x) = 1/(1+\exp (-x))\), the negative samples \(v^{'}\) subject to the noise distribution \(P_{n}(v)\sim d_v^{0.75}\), where \(d_v\) is the degree of vertex v. By training the parameters through this objective, the learned representation can not only preserve the high-order proximity but solve the over-smoothing problem caused by GCN at the same time.

3.2.6 A joint leaning model: SANECE

Overall, SANECE is a joint learning model. Taking network structure and attribute information as inputs, passing by a graph convolutional network and a fully connected neural network, we encode an individual and its corresponding local “centroid” for each node, respectively. Base on them, two branches are performed to learn the embedding jointly. As shown in Fig. 1, the above branch is a “centroid”-based triplet regularizer which aims to make the nodes in the network form a suitable distribution. The branch below is a “centroid”-augment Skip-gram model, which can preserve higher-order proximity of the network. By updating the parameters of GCN and FC through these two objectives, SANECE fully reconstructs the structural and attribute information. Besides, throughout the whole model, network structure potentially guides the aggregation of attribute information, therefore guarantees the affinity between them. Consequently, we combine the two lost functions and add a regularization term to prevent overfitting:

$$\begin{aligned} L_{reg}= \sum _{h=1}^{l}(\Vert {\mathbf {W}}_F^{(h)}\Vert _F^2+|{\mathbf {b}}^{(h)}\Vert _F^2), \end{aligned}$$
(9)

thus the final lost function for SANECE is defined as:

$$\begin{aligned} \begin{aligned} L=&L_{cen}+\lambda L_{glo}+\eta L_{reg}\\ =&\sum _{v_i\in {\mathcal {V}}}[\Vert {\mathbf {y}}_i^{(2)}-{\mathbf {q}}_i^{(l)}\Vert _F^2- \Vert {\mathbf {y}}_i^{(2)}-{\mathbf {q}}_j^{(l)}\Vert _F^2\ + \alpha ]_+\\ -&\lambda \sum _{v_i\in {\mathcal {V}}}\sum _{v_c\in {\mathcal {C}}_i} \{\log \sigma ({\mathbf {v}}_c^\mathrm {T}{\mathbf {h}}_i) +\sum _{s=1}^{|neg|}{\mathbb {E}}_{v^{'}\sim P_{n}(v)}\log \sigma (-{{\mathbf {v}}^{'}}^{\mathrm {T}}{\mathbf {h}}_i)\} \\ +&\eta \sum _{h=1}^{l}(\Vert {\mathbf {W}}_F^{(h)}\Vert _F^2+|{\mathbf {b}}^{(h)}\Vert _F^2), \end{aligned} \end{aligned}$$
(10)

where \(\lambda \) and \(\eta \) are the trade-off parameters, \({\mathbf {h}}_i={\mathbf {y}}_i^{(2)}\oplus {\mathbf {q}}_i^{(l)}\). We adopt Adam optimizer to minimize the lost function based on Equation (10) until convergence.

Finally, we take the concatenated vector \({\mathbf {h}}_i\) as the final embedding vector, which can successfully preserve the local-global network structure and attributes of nodes while ensuring affinity between them. The learning algorithm for SANECE is summarized in Algorithm 1.

figure a

3.3 Complexity analysis

In this section, we present the complexity analysis of SANECE. We denote n, e, and m as the number of nodes, edges, and attributes, respectively, the final dimension of the node representation is d. Due to the operation of sparse matrix, the complexity of the GCN is \(O(emd^g_1d^g_2...d)\) [2], and for FC network, the complexity is \(O(nmd^f_1d^f_2...d)\), where \(d^g_i\) and \(d^f_i\) represent the dimension of the i-th hidden layer in GCN and FC network, respectively. Besides, because of the negative sampling strategy, the complexity of Eq. (8) is \(O(ndb(1+|neg|))\), where b is the window size of the random walk and |neg| is the number of negative samples. Accordingly, the overall time complexity of the SANECE model is \(O(n+e)\) and is linear to the number of nodes and edges which demonstrate that SANECE can be applied to large-scale networks.

Table 1 Statistics of datasets

4 Experiments

In this section, we conduct extensive experiments over eight attributed networks to verify the effectiveness of our proposed model. Details of the code are publicly available.Footnote 1

4.1 Datasets

We employ eight widely used datasets.Footnote 2 The statistics are shown in Table 1, and the details are described as follows:

  • Citation Networks: Cora [12], Citeseer [15], Pubmed [23] are three citation networks. Each node of them represents a paper, and the edge represents a cite link. The attribute vector of each node is the bag-of-words vector of the paper. Besides, each paper is classified into an academic area as its label.

  • Web Networks: Wiki [14] and WebKB [28] are both web networks in which each node is a web page, and edge among different nodes represents the jump relationship between web pages. The attribute vector of each node is the vocabulary involved in each web page, and each web page belongs to a particular category. As for WebKB, it contains four different university subnetworks, i.e., Cornell, Texas, Washington, and Wisconsin.

4.2 Experimental settings

In this section, we first introduce several state-of-the-art baselines. Then we describe the parameter settings of our proposed model and other baselines.

4.2.1 Baseline methods

  • DeepWalk, LINE, HOPE only consider the topological structure. DeepWalk [21] is the pioneer of network embedding which utilize the Skip-gram model proposed in Word2vec [18]. LINE [24] preserves both first-order and second-order proximity by two different lost functions. HOPE [19] obtains embedding by decomposing similarity matrix based on katz index.

  • DeepWalk*, LINE*, HOPE* represent the attributed-incorporate version of methods without ‘*’, we reduce the attribute vectors to 128-dimensional vectors by SVD decomposition, then concatenate them with the embedding generated by the structure-only methods.

  • TADW, AANE, ANRL, CAN are attributed network embedding methods. TADW [31] takes text feature into account while conducting matrix factorization. AANE [10] learns representation based on the decomposition of attribute affinity. ANRL [34] utilizes a neighbor enhancement autoencoder and attribute-aware Skip-gram to learn the embedding. CAN [17] adopts a variational autoencoder to embed structure and attribute information.

  • GraphSAGE, DGI are graph neural networks-based attributed network embedding methods. GraphSAGE [8] aggregates the attributes of neighbors by sampling. DGI [27] learns representation by maximizing mutual information between patch- and graph-level representation.

  • SANECE Variants are variants of SANECE with different negative sampling strategies. SANECE-R samples randomly, while SANECE-S and SANECE-A select negative samples via structural or attributed negative sampling strategy.

Regarding the complexity of the baselines, Deepwalk is O(nlogn), AANE is \(O(n^2)\), and TADW is \(O(n*e)\). The training time of them is not linearly to the number of nodes or edges, so it takes a lot of time for them to train when applied to large-scale networks. The complexity of GraphSAGE, ANRL is O(n), and LINE, Hope, DGI is O(e). As for CAN and our SANECE Variants, the computation complexity is \(O(n+e)\). Therefore, the complexity of these baselines is linearly to the number of nodes or edges in the network, and they can be easily applied to large-scale networks.

Table 2 The detailed architecture of neural networks. For each dataset, the first column corresponds to GCN layer. The second column corresponds to FC layer

4.2.2 Parameter settings

Regarding the number of GCN layers, since the GCN layer plays a role in aggregating neighborhood’s messages, we adopt two layers to achieve the purpose of defining a second-order neighborhood as a “cluster”. As for the number of neurons in each GCN and FC layer, we get the optimal setting based on empirical experiments. The architecture of neural networks for SANECE model over all datasets is shown in Table 2. \(\eta \) is the trade-off parameter that controls the regularization term, we set it to 5 to prevent the model from overfitting. We grid search the trade-off parameter \(\lambda \) from (0, 0.5, 1, 1.5, 2) and \(\alpha \) which represents the margin from (0, 0.2, 0.4, 0.6, 0.8, 1.0). We generate truncated random walks \(\gamma \) = 20 starting at each node with walk length as 10 and the window size as 20, to ensure that the distance between the context node and the center node is sufficiently far to emphasize the high-order similarity and the global structure preservation. The learning rate is set as 0.001, and the size of final node embedding is 128 which can best preserve the various information of the network. More details on parameter sensitivity can be seen in Sect. 4.7. Besides, we adjust the batch size in different downstream tasks, in order to achieve the optimal performance of the SANECE model.

We adopt the source codes of baselines from original authors. For LINE, we combine the first- and second-order embedding as the final embedding. For GraphSAGE, we use the mean aggregator. Moreover, we choose the WAN variant for ANRL. We adopt the following metrics to evaluate the performance of different models: Micro-F1, Macro-F1 for classification, and accuracy for clustering.

Table 3 Node classification results for Cora, Citeseer, Wiki, Pubmed w.r.t. Micro-F1 and Macro-F1
Table 4 Node classification results for WebKB w.r.t. Micro-F1 and Macro-F1

4.3 Node classification

A popular application for network embedding is node classification. After learning the representation of the network, we randomly sample \(30\%\) of the nodes with labels to train an SVM classifier, and the rest is used for testing. We conduct node classification on Cora, Citeseer, Wiki, and Pubmed. We set the batch size as 1024 for Pubmed and 512 for others. Tables 3 and 4 summarize the classification results. From the results, we can observe:

  • From the top 6 rows of the classification results, methods with ‘*’ show better performance than the corresponding structure-only algorithms without ‘*’, confirming the validity of incorporating attribute information into network embedding.

  • However, they do not achieve the performance of other attributed network embedding methods, indicating concatenating structural and attribute information cannot capture affinity between them, thus affects the quality of node representation.

  • LINE shows the worst performance among all baselines since it only preserves the local network structure but ignores the global one and attribute information.

  • SANECE variants outperforms competitors above in most cases, demonstrates the effectiveness of our proposed model. Both SANECE-A and SANECE-S perform better than SANECE-R, which proves that negative nodes chose by attribute information and network structure are more meaningful than randomly selected negative nodes. In addition, SANECE-A shows better performance than SANECE-S in most situations, demonstrating attributes contain more useful information and can better reflect the relationship between nodes than structure in our model.

Table 5 Node clustering results for Cora, Citeseer, Wiki, WebKB w.r.t. accuracy

4.4 Node clustering

We conduct the unsupervised network mining task, i.e., node clustering, to evaluate our proposed model. After the learning process, we adopt K-Means as the clustering algorithm and report the performance measured by accuracy. The batch size is set as 512 for Pubmed and 256 for others. We report the clustering results in Table 5 and get the following key observations:

  • Similar to classification results, traditional algorithms such as DeepWalk perform much more poorly than the ones with ‘*’ for lacking attribute information. Also, their comparison with other state-of-the-art attributed network embedding methods proves the importance of affinity between network structure and attribute information once again.

  • Neural network-based attribute network embedding methods such as GraphSAGE, ANRL, CAN, and DGI shows excellent performance in the task of node clustering, which shows that neural networks can effectively extract the complex semantic information in the network.

  • From the clustering results, it is evident that SANECE outperforms ANRL, which indicates that our “centroid”-based triplet regularizer and “centroid”-augment Skip-gram module can preserve the network structure and attribute information better than the modules proposed in ANRL.

  • The same as the result of the classification experimental results, SANECE variants achieve competitive performance to the state-of-the art algorithms, confirm their capacity to generate meaningful embeddings for attributed networks. Moreover, SANECE-A performs the best among them, which demonstrates it is more likely to seek out valuable negative nodes based on attributes than structure and attribute information plays a more important role than network structure during negative sampling in our model.

Fig. 2
figure 2

t-SNE visualization for Cora

4.5 Network visualization

To further prove the rationality of our approach, we visualize the embedded representation of the Cora dataset by t-SNE [16] visualization tool. Accurately, each point represents a node in a network, where the color indicates the category to which it belongs. We compare our SANECE-A with three representative methods: DeepWalk, ANRL, and DGI. As shown in Fig. 2, DeepWalk learns a confusing embedding in which the node is relatively scattered. The visualization of ANRL shows the general clustering division, as a result of taking the neighborhood structure and attribute information of nodes into account. DGI also shows excellent visualization performance and proves the power of graph neural networks. However, many nodes are still floating among different clusters, which are disadvantageous for downstream tasks. In accord with our conjecture, our proposed model learns a more discriminative embedding with centralized clusters and transparent borders, which are more advantageous for downstream tasks.

Table 6 Ablation study results on Cora with different training ratio

4.6 Ablation study

We further identify the effect of each component of SANECE. Specifically, we adopt SANECE-A as the default SANECE model. We conduct node classification on Cora dataset with different training ratios w.r.t. Micro-F1. The results are presented in Table 6. We use SANECE_cen and SANECE_glo to indicate the model which makes use of “centroid”-based triplet regularizer and “centroid”-augment Skip-gram model exclusively. We can observe SANECE outperforms SANECE_cen and SANECE_glo, proving the two components are complementary to each other, and the joint learning of the two components can improve the performance of the model. In addition, we use the adjacency vector instead of the attribute vector to feed into SANECE model, then combine the output 128-dimensional embedding vector with the 128-dimensional attribute vector obtained by SVD decomposition as final embedding and denote it as SANECE*. Easy to find it performs worse than SANECE, indicating there is a potential correlation between network structure and attribute information, namely affinity. Directly combining these two kinds of information will lose such correlation, and our proposed method leverages the network structure to guide the aggregation of attribute information, to adequately capture the affinity and improve the performance of attribute network embedding.

Fig. 3
figure 3

Analysis on \(\lambda \), \(\alpha \), the size of local neighborhood and final embedding

4.7 Parameter sensitivity

In this section, we conduct the parameter sensitivity investigation and evaluate the effect of \(\lambda \), \(\alpha \), the size of local neighborhood t, which we consider as a “cluster” and the size of final embedding. We report the Micro-F1 and accuracy for the node classification and node clustering on Cora, Citeseer, Wiki, and Pubmed. For the parameter \(\lambda \), it represents the balance point between “centroid”-based triplet regularizer and “centroid”-augment Skip-gram Model, the larger \(\lambda \) is, the more SANECE model focuses on the latter module. We tune \(\lambda = (0,0.5,1,1.5,2)\), note that when \(\lambda \) equals 0, SANECE learns the representation only by the “centroid”-based triplet regularizer. From Fig. 3a, b, there is an interesting phenomenon that when \(\lambda =0\), the performance is terrible, mainly suffering from the over-smoothing problem of GCN. Furthermore, it turns to be better when \(\lambda \) grows larger, which shows that the other module can effectively solve the problem while paying much attention to the global network structure. It achieves the best performance when \(\lambda = 1\), showing both the two modules are essential to reconstruct the local-global network structure and attribute information. For the parameter \(\alpha \), it controls the distance difference between the “centroid” and the positive and negative nodes. The larger the \(\alpha \), the larger the margin. We grid search \(\alpha \) from (0,0.2,0.4,0.6,0.8,1.0). From Fig. 3c, d, when \(\alpha \)=0, the performance is poor, because the “centroid” cannot clearly distinguish between positive and negative nodes, which affects the performance of the model. With the increase of \(\alpha \), the performance of the model does not improve significantly, so the model is not sensitive to alpha. For the parameter t, we tune it in the range of first-order, second-order and third-order to choose an appropriate area of the local neighborhood. The rest of the parameters remain unchanged. From Fig. 3e, f, it is clear that second-order performs the best in most cases, which we believe the first-order neighbors gather little information while there may be some noise in third-order neighbors. We also analyze how the dimension of the final node representation affect the performance in Fig. 3g, h. We adjust the number of neurons in the last layer of the neural networks to (32,64,128) so that the final embedding size is (64,128,256). The performance is improved with the increase of embedding size, because more information can be gathered with more bits in embedding layer. However, as the dimension increases, noise may be introduced into the node representation, which affects downstream tasks. We choose \(\lambda =1\), \(\alpha =0.8\), the range of second-order neighborhood as a “cluster” and 128-dimensional final embedding vector as default.

5 Conclusion

In this paper, we present SANECE, a novel unsupervised method to perform attributed network embedding, which effectively preserves the network structure and attribute information. Specifically, we calculate a “centroid” for each node according to its local structure through graph convolutional networks. Based on that, two modules, namely “centroid”-based triplet regularizer and “centroid”-enhancement Skip-gram model, are applied to jointly exploit the local and global network structure. Besides, throughout the model, attribute information is aggregated under the guidance of network structure, to ensure the affinity between them and learn a smooth representation for each node. We conduct extensive experiments on several attributed networks, and experimental results demonstrate that our SANECE model can learn a discriminative network representation and is more advantageous for the downstream network mining tasks compared to the state-of-the-art methods.