Improving unified named entity recognition by incorporating mention relevance

Ji, Lijun; Yan, Danfeng; Cheng, Zhuoran; Song, Yan

doi:10.1007/s00521-023-08820-6

Improving unified named entity recognition by incorporating mention relevance

Original Article
Open access
Published: 06 August 2023

Volume 35, pages 22223–22234, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Improving unified named entity recognition by incorporating mention relevance

Download PDF

Lijun Ji¹,
Danfeng Yan¹,
Zhuoran Cheng¹ &
…
Yan Song²

2024 Accesses
2 Citations
Explore all metrics

Abstract

Named entity recognition (NER) is a fundamental task for natural language processing, which aims to detect mentions of real-world entities from text and classifying them into predefined types. Recently, research on overlapped and discontinuous named entity recognition has received increasing attention. However, we note that few studies have considered both overlapped and discontinuous entities. In this paper, we proposed a novel sequence-to-sequence model that is capable of recognizing both overlapped and discontinuous entities based on machine reading comprehension. The model utilizes machine reading comprehension formulation to encode significant inferior information about the entity category. Then input sequence passes through a question-answering model to predict the mention relevance of the given source sentences to the query. Finally, we incorporate the mention relevance into the BART-based generation model. We conducted experiments on three type of NER datasets to show the generality of our model. The experimental results demonstrate that our model beats almost all the current top-performing baselines achieves a vast amount of performance boost over current SOTA models on overlapped and discontinuous NER datasets.

Research on Named Entity Recognition Based on Bidirectional Pointer Network and Label Knowledge Enhancement

DSMER: A Deep Semantic Matching Based Framework for Named Entity Recognition

Span-Based Nested Named Entity Recognition with Pretrained Language Model

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Named entity recognition (NER) is a fundamental task for many natural language processing including information extraction, question answering systems, syntactic analysis and machine translation. It aims to identify mention spans from text input according to pre-defined entity categories, such as location, person, and organization [1, 2]. With the popularity of deep learning, NER methods have been extensively investigated and the current state-of-the-art NER models have been well established, which generally is solved as a sequence labeling issue. However, this framework is difficult to handle complex business scenarios such as overlapped and discontinuous NER.

Traditionally, many previous approaches regarded NER as a sequence labeling problem [3,4,5] which assign a tag for each token in the sentence. Their underlying assumption is that an entity mention should be a short span of text, and should not overlap with each other. However, overlapped and discontinuous entities appear in corpora in many domains, especially in clinical corpus. Traditional approaches are unfortunately incapable of handling these special types of entities. For example, Fig. 1 shows a example that contains flat, overlapped and discontinuous entity. For overlapped NER, if a token belongs to multiply entities, it needs to be assigned to multiply categories. For discontinuous entities, if several spans belong to one mention, there is a need to devise a method to connect these spans.

Recently, various approaches for overlapped or discontinuous NER have been proposed. Quite a few researchers revised sequence models to support overlapped entities using different strategies [6,7,8]. Some works adopt span-level methods [9,10,11,12,13] to enumerate all possible spans and conduct the span-level classification. However, these methods suffer from some serious weaknesses. On the one hand, enumeration would produce excessively invalid spans. On the other hand, this pipeline strategy will lead to accumulation of errors since incorrectly segmented entity boundaries will lead to classification errors. Most importantly, these methods cannot recognize overlapped and discontinuous entities at the same time.

Different from the above studies, Li et al. [14, 15] first treat NER as a machine reading comprehension (MRC) task and propose a new framework that is capable of handling both flat and overlapped NER. Though their models perform state-of-the-art on multiply datasets, it still failed to recognize discontinuous entities. Recently, numerous studies have focused on sequence generation models and have achieved significant progress. Phan et al. [16] proposed a method named NER2QUES used NER types to automatically generate questions for a low-resource language such as Vietnamese. Cui et al. [17] tried to consider NER as a sequence-to-sequence language model problem and presented a template-based NER using BART. Yan et al. [18] utilized a novel and simple seq2seq framework with the pointer mechanism to generate the entity sequence directly. Though this framework is capable of handling both overlapped and discontinuous NER tasks, the categories which partially match with entities are not effectively utilized.

After carefully rethinking the common characteristics of all three types of NER, we find that the bottleneck of sequence generate unified NER more lies in the lack of inferior information in the generation phase. Inspired by the works about MRC and query focused summarization(QFS) [19], we proposed a sequence-to-sequence model that is capable of recognizing both overlapped and discontinuous entities based on mention relevance attention(MRA). The model utilizes machine reading comprehension(MRC) formulation to encode significant inferior information about the entity category, and then pass through a state-of-the-art QA model [20] to predict the mention relevance of the given source sentences to the query, the further incorporate the mention relevance into the BART-based generation model. We conducted experiments on three type of NER datasets to show the generality of our model. The experimental results demonstrate that our model can effectively recognize not only the flat NER but also the overlapped and discontinuous NER. Our contribute in this work can be summarized as follows:

We proposed a novel sequence-to-sequence model that is capable of even recognizing both overlapped and discontinuous entities at the same time.
We proposed an effective method to calculate mention relevance score of the given source sentences to the query and incorporate it into the the encoder-decoder attention in pre-trained language models which can produce more inferior information. To the best of our knowledge, we first proposed the mention relevance score calculation method in NER task.
We conducted experiments on three type of NER datasets. Results shows that our methods can effectively recognize overlapped and discontinuous entities and give competitive results with state-of-the-art other approaches on multiply datasets.

2 Related works

Named entity recognition plays a key role in natural language processing (NLP). This field allows the computer to understand the contents of a given text and detect mentions of real-world entities from text and classify them into predefined types. There are three common named entity recognition methods: sequence labeling methods, span-based classification and Seq2Seq-based methods.

2.1 Sequence labeling methods

In the field of natural language processing, named entity recognition is usually treated as a sequence labeling task [3,4,5, 14, 21, 22], which assign a predefined label to each token, and decode entities from the labeled sequence. With well-designed features, machine learning algorithms can achieve excellent performance such as hidden Markov models (HMMs) [23] and conditional random fields (CRFs) [5]. Recently, the development of pre-trained language models has brought research in the field of NLP to a new level. Contextualized word embedding such as BERT [24] and ELMo [25] further enhanced the performance of NER, yielding state-of-the-art performances [22, 25]. However, overlapped entities were noticed back in 2003 [26]. Jana et al. [27] tried to apply the sequence labeling method to recognize the overlapped entities with concatenating labels. It will be difficult to predict because of the exponential increase in labels.

2.2 Span-based classification

In order to identify overlapped and discontinuous entities in text, many scholars abandoned the sequence labeling method and began to try the span-level classifications. These methods enumerate all possible spans, and determining if they are valid mentions and the types. Li et al. [13] traversed over all possible text spans to recognize entity fragments, then perform relation classification to judge whether a given pair of entity fragments to be overlapping or succession. Wang et al. [11] decomposed the problem of extraction of discontinuous entities. Two neural components are designed for these subtasks, respectively, and they are learned jointly using a shared encoder for text. Li et al. [14] regard NER as a MRC task and extract entities as the answer spans. Luna et al. [10] proposed a framework DyGIE for overlapped entity recognition tasks which uses a dynamically constructed span graph to share span representations. But span-based classification methods usually focus on identifying the boundary information of entities. And due to the exhaustively enumerating nature, those methods are less capable of handling long span entities.

2.3 Sequence generate methods

Gillick et al. [28] proposed a sequence-to-sequence model to predict the entity’s start, span length and label for NER task. Jana Straková et al. [27] proposed a Seq2Seq model for overlapped NER, where decoder predicted one by one for each token, until the decoder outputs the “$<eow>$” (end of word) label and moves to the next token. Yan et al. [18] utilized a novel and simple seq2seq framework with the pointer mechanism to generate the entity sequence directly. Inspired by Yan et al., Chen et al. [29] proposed a lightweight tuning paradigm for few-shot NER via pluggable prompting (LightNER). Cui et al. [17] treated NER as a language model ranking problem in a Seq2Seq framework and proposed a template-based method, where the target sequence is templates filled by candidate named entity. Fei et al. [30] present a model for irregular (e.g., overlapped or discontinuous) NER based on pointer networks, where the pointer simultaneously decides whether a token at each decoding frame constitutes an entity mention and where the next constituent token is.

2.4 Other methods

Different from above approaches, Dai et al. proposed a transition-based neural model [31] for discontinuous NER. Wang et al. proposed hyper-graph model [32] and utilized deep neural networks to enhance it. and W2NER [33] methods. Li et al. [33] simultaneously solved three named entity recognition subtasks by classifying the relationship between words.

Seq2seq-based methods have achieved the best results and are currently the most popular methods in the NER community. However, existing methods pay too much attention to the boundary information of entities, while ignoring the deep semantic relationship between entities and labels. In this paper, we incorporate the mention relevance attention into the BART-based generation model. To the best of our knowledge, we are the first to leverage mention relevance to a mrc-based generative NER task. In addition, our approach can effectively recognize both overlapped and discontinuous entities.

3 Unified NER framework

In this part, we first give the formal definition of generative named entity recognition. Then we introduce the method of generating mention relevance scores and present our approach to incorporate the mention relevance attention into transformer-based model, as shown in Fig. 2. Finally, we present the overall framework of our proposed model in Fig. 3 and introduce the detail of the model.

3.1 Task formalization

Given an input sequence $X = \{ x_{1}, x_{2},..., x_{n} \}$, where n denotes the length of the sequence. We follow Li et al. [14] to associate each tag $t \in T$ with a natural language query $q = \{ q_{1}, q_{2},..., q_{m} \} \in Q$, where m is the length of the query, T is a predefined list of all possible tag types and Q is the natural language query associated with the tag T. Our goal is to find every entity of tag t according to query q. In order to formulate three kinds of named entities, we define the output sequence as $Y = \{s_{11}, e_{11},..., s_{1j}, e_{1j}, t,..., s_{i1}, e_{i1},..., s_{ij}, e_{ij}, t\}$, where s, e are the start and end index of span. Since an entity may contain one or more than one spans, each entity is represented as $[s_{i1}, e_{i1},..., s_{ij}, e_{ij}]$.

3.2 Mention relevance attention

In this section, we will give our approach to calculate mention relevance score of the given source sentences to the query. BERT-MRC [14] explored different query construction methods and found Annotation Guideline Notes achieved the highest F1-measure. Therefore, we follow them to generate the query using Annotation Guideline Notes. For example, The query for tag ORG is “find organizations including companies, agencies and institutions.” Original sentence concatenated with query as the input sequence. Then, we followed [20] to pretrained a QA model on multiply NER datasets and generate mention relevance score with this model. Given a sentence that contains n tokens and a query that contains m tokens, the model outputs a distribution $s \in (0, 1)$ for each word’s probability of being the start of the mention and a probability distribution $e \in (0, 1)$ to be the end token of mention. To generate the mention relevance score r for each token, we calculate it by averaging of two distributions:

$$\begin{aligned} r = \frac{s + e}{2} \end{aligned}$$

(1)

where $r \in (0, 1)$.

Sequence-to-sequence model transformer [34] were introduced in 2017 by a team at Google Brain and are increasingly the model of choice for NLP problems. Transformer contains three attention mechanisms, namely encoder self-attention, masked decoder self-attention and encoder-decoder attention. It iss core-component scaled dot-product attention as below:

$$\begin{aligned} \text{Attention}\left( Q, K, V\right) ={\text {softmax}}\left( \frac{QK^T}{\sqrt{d_k}}V\right) \end{aligned}$$

(2)

where d is the dimension of the key matrix K. Attention was first proposed by Bahdanau et al. [35] for Neural Machine Translation. The mechanism is particularly useful for machine translation as the most relevant words for the output often occur at similar positions in the input sequence. The Transformer encoder layer consists of self-attention layers, where all keys, values and queries come from the original input sentence. This makes every token in the input attend to all other tokens. The Transformer decoder layer which has an additional encoder-decoder attention layer is different from encoder layer. In the encoder-decoder attention layer, queries come from the decoder’s self-attention layer, and all keys and values come from the output of the encoder layer. This allows every generating token to attend to not only tokens in the input sequence but also tokens in generated output sequence.

In this paper, we propose a mention relevance attention (MRA) to incorporate the token-level mention relevance score in the transformer decoder. Given a input sentence with n tokens, we generate a mention sequence with a maximum length of t tokens. Let $x^{l} \in \mathbb {R}^{n \times d}$ denotes the output of the l-th transformer encoder layer and $y^{l} \in \mathbb {R}^{t \times d}$ denotes the output of the l-th transformer decoder layer’s self-attention layer. The encoder-decoder attention $\alpha ^{l} \in \mathbb {R}^{t \times n}$ can be computed as:

$$\begin{aligned} \alpha ^{l}={\text {softmax}}\left( \frac{\left( y^{l} W_{Q}\right) \left( x^{l} W_{K}\right) }{\sqrt{d_{k}}}+A_{ar}\right) \end{aligned}$$

(3)

where $W_{Q}$ and $W_{K} \in \mathbb {R}^{d_{k} \times d_{k}}$ are parameter weights and $A_{a r} \in \mathbb {R}^{t \times n}$ is our mention relevance score. Since the original mention relevance score is an n-dimensional vector, we repeat it t times to generate an t by n attention matrix, which means our mention relevance attention is equal to all generated tokens.

3.3 MRA-based sequence-to-sequence NER

We consider NER as a natural language generation (NLG) task under sequence-to-sequence framework. Generative pre-trained models have shown remarkable performance in NLG, including text summarization. We choose to combine our answer relevance attention with BART, a denoising autoencoder built with a Seq2Seq model, for two reasons: (1) [17, 18] demonstrated the effectiveness of bart on NER tasks and achieves state-of-the-art performance on several irregular NER datasets. (2) BART follows the standard transformer encoder-decoder architecture, and we can easily combine the answer relevance as explicit attention to the encoder-decoder attention layers. We incorporate the same mention relevance attention for all transformer decoder layers. The detail of overall architecture in Fig. 3.

Given a sentence X and the query Q, in order to make the model capture the relevance between token and query, the input sequence is formatted in the following way:

$$\begin{aligned} {[}\hbox {bos}]\hbox { sentence [sep] query [eos]} \end{aligned}$$

Then, the input sequence will be feed into bart encoders, a transformer-based model to get word representations. The mention relevance score for the sentence is generated by the mention relevance attention. After, the decoder uses the pointer mechanism to generate indexes of original sentence and tags.

Since we formulate the NER task in a generative way, we can view the NER task as the following equation:

$$\begin{aligned} P(Y \mid X)=\prod _{t=1}^{m} P\left( y_{t} \mid X, Y_{<t}\right) \end{aligned}$$

(4)

where $y_0$ is the special “start of sentence” control token. We use the Seq2Seq framework with the pointer mechanism to tackle this task. Therefore, our model consists of two components:

3.3.1 Encoder

In our approach, we concatenate the external query Q at the end of the input sentence X to form input sequence $\tilde{X}$. Each word $x_i\left( 1 \le i \le n+m+3\right)$ is represented by adding a word embedding $x_i^w$ and a position embedding $x_i^p$. Encoder embedding layers will encode the input sequence into vectors $H^e$, which formulates as follows:

$$\begin{aligned} \textbf{H}^{e}={\text {Encoder}}\left( x_i^w + x_i^p\right) \end{aligned}$$

(5)

where $\textbf{H}^{e} \in \mathbb {R}^{n \times d}$, and d is the hidden dimension. In the encoder, bidirectional attention layers are used to enable interaction between every pair of tokens and produce the encoding for the context.

3.3.2 Decoder

The generation process models the conditional probability of selecting a new token given the previous tokens and the input to the encoder. Decoder is to get the index probability distribution for each step $P_{t}=P\left( y_{t} \mid X, Y_{<t}\right)$. However, since $Y_{<t}$ contains the pointer and tag index, it cannot be directly inputted to the decoder. We followed [18] to use the Index2Token conversion to convert indexes into tokens.

$$\begin{aligned} \hat{y}_{t}= {\left\{ \begin{array}{ll}X_{y_{t}}, &{} \text{ if } y_{t}\ \text{is\, a\, pointer\, index} \\ G_{y_{t}-n}, &{} \text{ if } y_{t}\ \text{is\, a\, tag\, index}\end{array}\right. } \end{aligned}$$

(6)

where $G = [g_1, g_2,..., g_m]$ is the set of entity categories (such as “Person” and “Organazation”), which are answer words corresponding to the entity category. After converting each $y_t$ this way, we can get the last hidden state $\textbf{h}_{t}^{d} \in \mathbb {R}^{d}$ with $\hat{Y}_{<t}=\left[ \hat{y}_{1}, \ldots , \hat{y}_{t-1}\right]$ as follows:

$$\begin{aligned} \textbf{h}_{t}^{d}={\text {Decoder}}\left( \textbf{H}^{e} ; \hat{Y}_{<t}\right) \end{aligned}$$

(7)

Then, we use pointer-generator mechanism to achieve the index probability distribution $P_t$. Our pointer-generator mechanism is followed a pointer-generator network proposed by [36], as it allows both copying words via pointing, and generating words from a fixed tag list. We definite a hyper-parameter $\alpha \in \mathbb {R}$ to choose between generating a word from the tag list by sampling from $P_{tag}$, or copying a word from the input sequence by sampling from the encoder distribution. We obtain the following probability distribution over the extended vocabulary:

$$\begin{aligned} \begin{aligned} \textbf{G}^{d}&={\text {TokenEmbed}}(G) \\ P_{tag}&={\text {Softmax}}\left( \textbf{G}^{d} \otimes \textbf{h}_{t}^{d}\right) \\ P(w)&=\alpha P_{\textrm{tag}}(w)+\left( 1-\alpha \right) \sum _{i: w_{i}=w} H_{t}^{i} \end{aligned} \end{aligned}$$

(8)

where TokenEmbed is the embeddings shared between the encoder and decoder; $\textbf{H}^{e} \in$ $\mathbb {R}^{n \times d}$; $\textbf{G}^{d} \in \mathbb {R}^{l \times d};$ $[\cdot ; \cdot ]$ means concatenation in the first dimension; $\otimes$ means the dot product. During the training phase, we use the negative log-likelihood loss and the teacher forcing method. During the inference, we use an auto-regressive manner to generate the target sequence.

4 Experiments

4.1 Datasets

We show the statistics of the datasets in Tables 1, 2, 3.

Table 1 Flat NER datasets. Statistics of the dateset sentences, mentions

Full size table

Table 2 overlapped NER datasets. Statistics of the dateset sentences, mentions and percentage of entity types

Full size table

Table 3 Discontinuous NER datasets. Statistics of the dateset sentences, mentions and percentage of entity types

Full size table

Flat NER Datasets For flat NER, we conducts experiments on two English datasets, CoNLL 2003 and OntoNotes 5.0. CoNLL-2003 is one of the most classic named entity recognition datasets. The dataset consists of Reuters news stories from August 1996 to August 1997, including place names, person names, organization names and other entities. In this paper, the training set and validation set are combined to train the proposed model. The OntoNotes 5.0 dataset, a collaborative project between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the Institute for Information Science at the University of Southern California, annotates a large corpus consisting of three languages (English, Chinese, and Arabic) of various types of text, containing structural information and shallow semantics. We uses the English dataset and removes the parts that do not contain entities.

Overlapped NER Datasets For overlapped NER, We conduct experiments on ACE 2004, ACE 2005 and Genia corpus. ACE 2004 and ACE 2005 datasets contain a full set of English, Arabic and Chinese training data for the evaluation of automatic content extraction (ACE) techniques. The corpus consists of various types of data, annotating entities and relationships. It mainly contains 7 entity types, and the sentences containing overlapped named entities account for about 30%. For ACE2004 and ACE2005, we use the same data split as the ratio between train, development and test set is 8:1:1. For the GENIA dataset, we use GENIA corpus3.02p. We follow the protocols in [37], we use five types of entities and split the train / dev / test as 8.1:0.9:1.0.

Discontinuous NER Datasets For discontinuous NER, we conduct experiments on three benchmark datasets from the biomedical domain: CADEC, ShARe13 and ShARe14 corpus. CADEC is derived from AskaPatient2, which is a forum where patients can discuss their medication experience. The entity types include adverse drug events(ADEs), diseases and symptoms. Since only ADEs entities contain discontinuous entities, only these entities are considered in this paper, which also allows us to directly compare our results with those of previous models. ShARe13 and ShARe14 focus on disease identification in clinical records, including discharge summaries, ECGs, echocardiograms, and radiology reports. Although the three datasets come from similar domains, the thrust of CADEC is very different from the ShARe dataset. In general, laymen (i.e., CADEC) tend to use idioms to describe their feelings, while professional practitioners (i.e., ShARe) tend to use concise terms to communicate effectively. This also leads to different features of discontinuous mentions between these datasets.

4.2 Baseline methods and evaluation metrics

4.2.1 Baseline methods

In order to validate whether the MRA-based generative named entity recognition model can extract entities, we compare the proposed model with several baseline models.

Sequence Labeling Models Traditional sequence labeling-based models which assign a predefined label to each token are usually used to identify flat entities. We selected typical BIO or BIOES benchmark model, such as LSTM-CRF [3] and CNN-BiLSTM-CRF [5].

Span-based Models These models enumerate all possible spans, and determining if they are valid mentions and the types. Their variations can solve the problems of overlapped and discontinuous entity identification. Span-based, BERT [13] perform relation classification to judge whether a given pair of entity spans to be overlapping or succession. MRC-BERT [14] formulates the NER task as a machine reading comprehension task. Biaffine-BERT [38] ranks all the spans in terms of the pairs of start and end tokens in a sentence using a biaffine model.

Sequence Generate Models These models generate entity sequences at the decoder side. Bart-NER [18] formulate the NER subtasks as an entity span sequence generation task. Template-based NER [17] treated NER as a language model ranking problem in a sequence-to-sequence framework. Pointer network [30] employs Seq2Seq with pointer network for discontinuous NER.

Other Models These models are different from the methods above, such as transition-based [31], hyper-graph [32] and W2NER [33] methods.

4.2.2 Evaluation metrics

In terms of evaluation metrics, we adopt the precision (P), recall (R) and F1-measure (F1) in prior works ([11, 14, 18]). A predicted entity is counted as true-positive if its boundary and type match those of a gold entity. For a discontinuous entity, each span should match a span of the gold entity. The calculation formula is as follows:

$$\begin{aligned} \begin{aligned} P&= \frac{\text {TP}}{\text {TP} + \text {FP}} \\ R&= \frac{\text {TP}}{\text {TP} + \text {FN}} \\ F1&= \frac{2 \times P \times R}{P + R} \end{aligned} \end{aligned}$$

(9)

4.3 Implementation details

For all the experiments, we use the BART-large version to implement our models. Its encoder and decoder each has 12 layers for all experiments. The network parameters are optimized by AdamW [39] with a learning rate of 1e-5. The batch size is fixed to 16. All the hyper-parameters are tuned on the dev set. We run our experiments on a NVIDIA GeForce RTX 3090 GPU for at most 50 epochs and choose the model with the best performance on the dev set to output results on the test set. We report the test score of the run with the median dev score among 5 randomly initialized runs. We report the span-level F1.

4.4 Main results

In this section, we illustrate the main performances of the proposed model on the main datasets. The best model in the development dataset is used to evaluate the test dataset. In the table, we mark the best results for each dataset in the bold.

4.4.1 Results for flat NER

Table 4 shows the results for flat NER datasets. Flat entity recognition is the most classic NER task, and these methods cover three different technical routes. As seen, Huang [3], Ma [5] and Li [14] et al. adapted the BIOES-based end-to-end model. These sequence labeling models do not perform better than the recently-proposed sequence generate model [18]. In contrast, for the CoNLL-2003 and en-Ontonotes 5.0 datasets, our model achieves the best performances with 94.95% and 91.34% on $F_1$. Especially, our model outperforms the previous unified NER framework Yan et al. [18] by +1.71% on CoNLL-2003 dataset, which demonstrates that MRA plays an important role in our model.

Table 4 Results for the flat NER datasets

Full size table

4.4.2 Results for overlapped NER

Table 5 present the results for overlapped NER datasets. The BERT-MRC result comes from Li et al. [14] and the W2NER result comes from our re-implementation via their code. As seen, our model outperforms the previous works, including sequence labeling models [14], span-based models [13] and sequence generate models [18] and achieves the SoTA performances on $F_1$, with + 0.89% and + 0.07% on ACE2004 and ACE2005 datasets, respectively. And it also achieves competitive $F_1$ on GENIA.

Table 5 Results for overlapped NER datasets

Full size table

Table 6 Results for discontinuous NER datasets

Full size table

4.4.3 Results for discontinues NER

We evaluate our model on three discontinuous NER datasets. Table 6 presents the comparisons between our model and other baselines. As shown in Table 3, only around 10% mentions are discontinuous in all datasets, which is far less than the continuous entity mentions. Therefore, we report the results on sentences that include at least one discontinuous mention. The results show our model performances previous best model [33] by +0.08%, +0.11% and +0.35% on CADEC, ShARe 13 and ShARe 14. It demonstrates that our model again defeats the baseline models in terms of $F_1$.

4.5 Ablation study

4.5.1 Effectiveness of mention relevance attention

In this work, we followed [14] to concatenate the external query at the end of the input sentence. This means that the datasets will be expanded by n times, where n is the number of entity classes category. This results in a lot of negative examples that there is no mention in the sentence that corresponds to the query. To show the effectiveness of inferior knowledge and mention relevance attention (MRA), we ablate each part of our model on CoNLL2003, ACE2004 and CADEC datasets.

Table 7 Model ablation studies. w/o means remove this part from the model

Full size table

Results in Table 7 show that the external knowledge and mention MRA can help to improve the F1. We find that: (1) The proportion of negative examples greatly affects the training results. When the proportion of negative samples is too high, the decoder will tend to judge negative samples, which means that the recall of negative samples is high, while the recall of positive samples is low. Adding the appropriate proportion of negative examples, the external context improves the $F_1$ measure by 0.94%, 0.19% and 0.06% for CoNLL-2003, ACE 2004 and CADEC, respectively. This is because it provides inferior knowledge for BART encoders. (2) On the premise of adding a suitable proportion of negative examples, MRA improves the $F_1$ measure by 0.26%, 1.23% and 0.76% for CoNLL-2003, ACE 2004 and CADEC, respectively. This is because it combines the correlation of the original input sentences and tags which can help detect the word of entity more accurately. We plot an example of the mention relevance scores in Fig. 4. As can be seen, the relevance between the query and each word of the given sentence are captured by mention relevance attention.

4.5.2 Inference efficiency

Sequence-to-sequence architecture is based on generation. In the inference step, we use beam search to increase the performance. It unfortunately suffers from the potential decoding efficiency problem. In this section, we compare the training and inference time of our proposed model and other baseline models including sequence labeling, span-based and sequence generate models. We use the BART-base version to calculate seconds needed to iterate one epoch (one epoch means iterating over all training set) and seconds needed to evaluate the development set. The comparison is presented in Table 8.

Table 8 The training and inference time comparison between three models. The results are average from at least five runs

Full size table

As we can seen, compare to the sequence labeling method, while during the evaluating phase, we have to autoregressively generate tokens, which will make the inference slow. Therefore, further work like the usage of a non-autoregressive method can be studied to speed up the decoding [40].

5 Conclusion and future work

In this paper, we reformalize the NER task as a sequence generate question. This formalization comes with three key advantages: (1) being capable of addressing overlapping or discontinuous entities; (2) the query encodes significant inferior knowledge about the entity category to extract; (3) the mention relevance score incorporate each token in sentence to the query, which enhance the decoding accuracy. The proposed method obtains SOTA results on both overlapped and discontinuous NER datasets, which indicates its effectiveness. In the future, we would like to improve the performance by exploring variants of the model architecture, including the inference efficiency and solving the problems in Chinese NER. In addition, the proposed approach can be extended to other NLP tasks such as relation extraction and event extraction.

Data availability

The authors declare that the datasets analyzed during this study were derived from public domain resources. They are available within the article.

References

Sang EFTK, Meulder FD (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the seventh conference on natural language learning. https://aclanthology.org/W03-0419/
Alex B, Haddow B, Grover C (2007) Recognising nested named entities in biomedical text. In: Biological translational and clinical language processing BioNLP@ACL, pp 65–72. https://aclanthology.org/W07-1009/
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. CoRR arXiv:abs/1508.01991
Chiu JPC, Nichols E (2016) Named entity recognition with bidirectional lstm-cnns. Trans Assoc Comput Linguist 4:357–370
Article Google Scholar
Ma X, Hovy EH (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. In: Proceedings of the 54th annual meeting of the association for computational linguistics. https://doi.org/10.18653/v1/p16-1101
Ju M, Miwa M, Ananiadou S (2018) A neural layered model for nested named entity recognition. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), pp 1446–1459. https://doi.org/10.18653/v1/N18-1131.https://aclanthology.org/N18-1131
Straková J, Straka M, Hajič J (2019) Neural architectures for nested ner through linearization. arXiv preprint arXiv:1908.06926
Wang J, Shou L, Chen K, Chen G (2020) Pyramid: a layered model for nested named entity recognition. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 5918–5928
Lu W, Roth D (2015) Joint mention extraction and classification with mention hypergraphs. In: Conference on empirical methods in natural language processing
Luan Y, Wadden D, He L, Shah A, Ostendorf M, Hajishirzi H (2019) A general framework for information extraction using dynamic span graphs. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 3036–3046. https://doi.org/10.18653/v1/n19-1308
Wang B, Lu W (2019) Combining spans into entities: a neural two-stage approach for recognizing discontiguous entities. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp 6215–6223. https://doi.org/10.18653/v1/D19-1644
Yu J, Bohnet B, Poesio M (2020) Named entity recognition as dependency parsing. In: Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, 5–10 July 2020, pp 6470–6476. https://doi.org/10.18653/v1/2020.acl-main.577
Li F, Lin Z, Zhang M, Ji D (2021) A span-based model for joint overlapped and discontinuous named entity recognition. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 4814–4828. https://doi.org/10.18653/v1/2021.acl-long.372
Li X, Feng J, Meng Y, Han Q, Wu F, Li J (2020) A unified MRC framework for named entity recognition. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 5849–5859. https://doi.org/10.18653/v1/2020.acl-main.519
Li X, Sun X, Meng Y, Liang J, Wu F, Li J (2020) Dice loss for data-imbalanced NLP tasks. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 465–476. https://doi.org/10.18653/v1/2020.acl-main.45
Phan TH, Do P (2021) Ner2ques: combining named entity recognition and sequence to sequence to automatically generating vietnamese questions. Neural Comput Appl 1–20
Cui L, Wu Y, Liu J, Yang S, Zhang Y (2021) Template-based named entity recognition using BART. In: Findings of the association for computational linguistics: ACL/IJCNLP 2021, vol ACL/IJCNLP 2021, pp 1835–1845. https://doi.org/10.18653/v1/2021.findings-acl.161
Yan H, Gui T, Dai J, Guo Q, Zhang Z, Qiu X (2021) A unified generative framework for various NER subtasks. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. https://doi.org/10.18653/v1/2021.acl-long.451
Su D, Yu T, Fung P (2021) Improve query focused abstractive summarization by incorporating answer relevance. In: Findings of the association for computationa linguistics: ACL/IJCNLP 2021, pp 3124–3131. https://doi.org/10.18653/v1/2021.findings-acl.275
Su D, Xu Y, Winata G.I, Xu P, Kim H, Liu Z, Fung P (2019) Generalizing question answering system with pre-trained language model fine-tuning. In: Proceedings of the 2nd workshop on machine reading for question answering, pp 203–211. https://doi.org/10.18653/v1/D19-5827
Liu L, Shang J, Ren X, Xu FF, Gui H, Peng J, Han J (2018) Empower sequence labeling with task-aware neural language model. In: Proceedings of the thirty-second AAAI conference on artificial intelligence, pp 5253–5260. AAAI Press. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17123
Sapci AOB, Tastan Ö, Yeniterzi R (2021) Focusing on possible named entities in active named entity label acquisition. CoRR arXiv:abs/2111.03837
Zhou G, Su J (2002) Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 473–480. https://aclanthology.org/P02-1060/
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 2227–2237. https://doi.org/10.18653/v1/n18-1202
Kim J, Ohta T, Tateisi Y, Tsujii J (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. In: Proceedings of the eleventh international conference on intelligent systems for molecular biology, pp 180–182. http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i180?etoc
Straková J, Straka M, Hajic J (2019) Neural architectures for nested NER through linearization. In: Proceedings of the 57th conference of the association for computational linguistics, pp 5326–5331. https://doi.org/10.18653/v1/p19-1527
Gillick D, Brunk C, Vinyals O, Subramanya A (2016) Multilingual language processing from bytes. In: NAACL HLT 2016, The 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1296–1306. https://doi.org/10.18653/v1/n16-1155
Chen X, Li L, Deng S, Tan C, Xu C, Huang F, Si L, Chen H, Zhang N (2022) Lightner: a lightweight tuning paradigm for low-resource ner via pluggable prompting. In: Proceedings of the 29th international conference on computational linguistics, pp 2374–2387
Fei H, Ji D, Li B, Liu Y, Ren Y, Li F (2021) Rethinking boundaries: end-to-end recognition of discontinuous mentions with pointer networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 12785–12793
Dai X, Karimi S, Hachey B, Paris C (2020) An effective transition-based model for discontinuous NER. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 5860–5870. https://doi.org/10.18653/v1/2020.acl-main.520
Wang B, Lu W (2018) Neural segmental hypergraphs for overlapping mention recognition. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (eds) Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31–November 4, 2018, pp. 204–214. Association for Computational Linguistics. https://doi.org/10.18653/v1/d18-1019
Li J, Fei H, Liu J, Wu S, Zhang M, Teng C, Ji D, Li F (2021) Unified named entity recognition as word-word relation classification. CoRR arXiv:abs/2112.10070
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, pp 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International conference on learning representations. arxiv: http://arxiv.org/abs/1409.0473
See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. In: Proceedings of the 55th annual meeting of the association for computational linguistics, pp 1073–1083. https://doi.org/10.18653/v1/P17-1099
Katiyar A, Cardie C (2018) Nested named entity recognition revisited. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 861–871. https://doi.org/10.18653/v1/n18-1079
Yu J, Bohnet B, Poesio M (2020) Named entity recognition as dependency parsing. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6470–6476. https://doi.org/10.18653/v1/2020.acl-main.577
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International conference on learning representations. https://openreview.net/forum?id=Bkg6RiCqY7
Gu J, Bradbury J, Xiong C, Li V.O, Socher R Non-autoregressive neural machine translation. In: International conference on learning representations

Download references

Author information

Authors and Affiliations

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Lijun Ji, Danfeng Yan & Zhuoran Cheng
School of Business and Management, Shanghai International Studies University, Shanghai, 200092, China
Yan Song

Authors

Lijun Ji
View author publications
You can also search for this author inPubMed Google Scholar
Danfeng Yan
View author publications
You can also search for this author inPubMed Google Scholar
Zhuoran Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Yan Song
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Danfeng Yan or Yan Song.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ji, L., Yan, D., Cheng, Z. et al. Improving unified named entity recognition by incorporating mention relevance. Neural Comput & Applic 35, 22223–22234 (2023). https://doi.org/10.1007/s00521-023-08820-6

Download citation

Received: 29 June 2022
Accepted: 28 June 2023
Published: 06 August 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00521-023-08820-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Improving unified named entity recognition by incorporating mention relevance

Abstract

Similar content being viewed by others

Research on Named Entity Recognition Based on Bidirectional Pointer Network and Label Knowledge Enhancement

DSMER: A Deep Semantic Matching Based Framework for Named Entity Recognition

Span-Based Nested Named Entity Recognition with Pretrained Language Model

Explore related subjects

1 Introduction

2 Related works

2.1 Sequence labeling methods

2.2 Span-based classification

2.3 Sequence generate methods

2.4 Other methods

3 Unified NER framework

3.1 Task formalization

3.2 Mention relevance attention

3.3 MRA-based sequence-to-sequence NER

3.3.1 Encoder

3.3.2 Decoder

4 Experiments

4.1 Datasets

4.2 Baseline methods and evaluation metrics

4.2.1 Baseline methods

4.2.2 Evaluation metrics

4.3 Implementation details

4.4 Main results

4.4.1 Results for flat NER

4.4.2 Results for overlapped NER

4.4.3 Results for discontinues NER

4.5 Ablation study

4.5.1 Effectiveness of mention relevance attention

4.5.2 Inference efficiency

5 Conclusion and future work

Data availability

References

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords