1 Introduction

Coronavirus disease 2019 (COVID-19), named by the World Health Organization, refers to pneumonia caused by the 2019 novel coronavirus (2019-nCoV) [1, 2]. As of August 5, 2021, more than 200 million confirmed cases and 4.2 million confirmed deaths had occurred worldwide. More importantly, the number of newly confirmed cases and new confirmed deaths remains at a high level and some specific variants of SARS-CoV-2 that are more transmissible and possibly more virulent have appeared worldwide [3]. With global vaccine production still insufficient, it is very important to find a fast and effective COVID-19 detection method.

The gold standard for diagnosing COVID-19 is reverse transcription polymerase chain reaction (RT–PCR) [4]. However, existing data show that the sensitivity of RT–PCR to COVID-19 infection is not high [5]. This leads to many patients who are infected with COVID-19 being mistaken as uninfected during rapid screening, which is not conducive to COVID-19 prevention and treatment.

As a common medical imaging tool, computed tomography (CT) imaging technology is a sensitive diagnostic method for COVID-19. Chest CT is an important supplement to RT–PCR in COVID-19 diagnosis[6]. Long et al. [7] reported that CT sensitivity was 97.2%, while the initial rRT-PCR sensitivity was 83.3%. In addition, CT can observe the pulmonary manifestations with different infections in different periods, which can assist doctors in diagnosis and treatment [8].

Although chest CT technology can be used as a diagnostic auxiliary tool for COVID-19, a large number of experienced radiologists are needed for screening. It increases work, and it is also prone to misdiagnosis caused by fatigue and other reasons. With successful deep learning (DL) applications in the image classification [9] and natural language processing fields [10, 11], considerable progress has been achieved in DL-based medical image processing task [12]. Deep learning is widely viewed as a crucial tool in COVID-19 diagnosis[13, 14], as they provide auxiliary information, which is very different from other computer vision tasks, such as head pose estimation [15], intelligent recommendation [16, 17], robot vision [18], and infrared imaging enhancement [19].

It is difficult to diagnose COVID-19 because the shape, size, and Hounsfield unit (Hu) value of the lesions on the CT images have large variation, as shown in Fig.1. Furthermore, the COVID-19 patients CT images are similar to CT images of other diseases (such as community-acquired pneumonia, H1N1), which also makes the diagnosis of COVID-19 very difficult, as shown in Fig. 2. In response to these problems, we propose a novel network MA-Net including mutex attention blocks and fusion attention blocks to achieve effective feature representation for COVID-19. We also design an adaptive weight loss function for a preferable diagnosis effect. The contributions of this paper are as follows:

  1. 1.

    Aiming at the problem that the COVID-19 CT images are similar to other diseases, this paper designs a multi-inputs network with shared parameters and a mutex attention block. The mutex input include a pair of positive and negative samples. The purpose of the mutex attention block is to amplify the difference between positive and negative samples, which enables the network to obtain better distinguishable features. It can significantly improve the ability of deep networks to diagnose COVID-19.

  2. 2.

    In this paper, the fusion attention block is designed to fuse features in the channel direction, which selects features between two inputs to obtain more representative features. It can further strengthen the network COVID-19 diagnostic capabilities.

  3. 3.

    Joint training of multiple loss functions is used in the work. An adaptive weight adjustment mechanism is proposed to automatically tune the weight of different losses. The experimental results show that the adaptation weight can effectively improve the diagnosis effect of the proposed network.

Fig. 1
figure 1

COVID-19 Dataset-A CT images

Fig. 2
figure 2

COVID-19 Dataset-C CT images[20]

2 Related work

Due to the COVID-19 outbreak, many researchers have proposed deep learning methods to analyze COVID-19 CT or X-ray images [21,22,23]. For COVID-19 diagnosis on X-ray, [24, 25] proposed methods and achieved good results. For COVID-19 diagnosis on CT images [26,27,28], many methods have been proposed and validated on different COVID-19 datasets.

2.1 COVID-9 diagnosis on CT

For private datasets, Ma et al. [29] proposed a new COVID-19 diagnosis network based on the multireceptive field attention module. This attention module includes pyramid convolution module (PCM), spatial attention block (SAB) and channel attention block (CAB). PCM is used to obtain multi-receptive field feature sets and send the multi-receiving field feature sets to SAB and CAB to enhance the feature. The proposed method was trained and verified on the DTDB provided by the Beijing Ditan Hospital Capital Medical University, which includes 40 patients infected with COVID-19 in different periods and 40 people without COVID-19 infection. 97.12% accuracy, 96.89% specificity and 97.21% sensitivity are obtained. Li et al. [30] developed a 3D deep learning framework for COVID-19 diagnosis (COVID-19, community-acquired pneumonia and non-pneumonia), referred to as COVNet. It consists of a sequence of shared parameters ResNet50. Experiments on collected dataset containing 4352 chest CT scans from 3322 patients, the per-scan sensitivity and specificity in the independent test set were 90% and 96% , respectively. Harmon et al. [31] proposed 3D model for differentiation of COVID-19 from other clinical entities. Training in a diverse multinational cohort of 1280 patients to localize parietal pleura/lung parenchyma followed by classification of COVID-19 pneumonia, it can achieve up to 90.8% accuracy, with 84% sensitivity and 93% specificity, as evaluated in an independent test set (not included in training and validation) of 1337 patients. For public datasets, Zhao et al. [20] provided a collated COVID-19 CT dataset containing 349 CT images believed to be positive for COVID-19. [20] also proposed a DensNet169 method combining contrastive self-supervised learning (CSSL) [33] and transfer learning (TL) for COVID-19 diagnosis and used CT images with a lung mask as input. Performing COVID-19 diagnosis on the proposed COVID-19 CT dataset, the accuracy, F1-Score and AUC were 85.0%, 85.9% and 92.8% respectively, although the number of CT images for training was only a few hundred. The COVID-19 CT dataset [20] provided by Zhao et al. has been widely used. Due to the limited size of the COVID-19 dataset, Mittal et al. [32] proposed a new clustering method for COVID-19 diagnosis. A novel variant of a gravitational search algorithm is employed to obtain optimal clusters. To validate the performance on the proposed variant, a comparative analysis of recent metaheuristic algorithms is conducted. The proposed method was verified on the COVID-19 CT dataset [20] and an accuracy of 0.6441 was obtained. Underperforming with traditional methods, many researchers focus on using deep learning for COVID-19 diagnosis. He et al. [34] proposed a self-transformation method that integrated comparative self-monitoring learning and transfer learning to learn strong and unbiased feature representations on the limited size dataset [20] to reduce the risk of overfitting. 86% accuracy, 85% F1-Score and 94% AUC were obtained, respectively. Wang et al. [35] proposed a new joint learning framework to achieve accurate diagnosis of COVID-19 by effectively learning heterogeneous datasets with distributed differences. A powerful backbone was built by redesigning the recently proposed COVID-Net [36] in terms of network architecture and learning strategy. Additionally, a contrastive training objective was applied to enhance the domain invariance of semantic embedding to boost the classification performance on each dataset. On the COVID-19 CT dataset [20], the proposed method achieved 78.69% accuracy, 78.83% F1-score and 85.32% AUC.

Ma et al. [29] used the pyramid convolution module to solve the problem of different sizes and shapes of COVID-19 lesions. Zhao et al. [20], Mittal et al. [32] and He et al. [34] focused on solving the problem of limited COVID-19 dataset. Li et al. [30] and Harmon et al. [31] focused on the use of large datasets for the diagnosis of COVID-19 and other lung diseases. These works have achieved excellent results. However, most of them used large-scale private datasets or were evaluated directly using commonly used deep learning network models.In this paper, we propose a COVID-19 diagnostic method based on mutex attention block to distinguish COVID-19 from other pulmonary diseases with limited data.

2.2 Image attention mechanism

The attention mechanism was first used in the field of natural language processing(NLP) and achieved state-of-the-art results.Then, using attention mechanism in the field of computer vision has recently received more research. Many outstanding attention modules have been proposed [37, 38]. The attention mechanism can be simply divided into channel attention, spatial attention, mixed attention and special attention models.

SE-Net [39] was proposed in 2017, which is a typical channel attention model and can be embedded in any basic network. SE-Net proposed a novel attention unit, the squeeze-and-excitation module (SE module), which adaptively recalibrates the channel characteristic response by explicitly modeling the interdependence between channels. Experiments show that by embedding the SE module into existing basic models (such as ResNet and VGG), it can bring significant performance improvements to the most advanced deep architecture at a small computational cost. Different from the SE module which only focuses on the feature correction of the channel dimension, CBAM [40] introduces an attention mechanism in both the spatial and channel dimensions. A set of feature maps is input, and the CBAM module sequentially infers the attention map along two independent dimensions (channel and spatial), and then multiplies the attention map with the input feature map to perform adaptive feature refinement.

3 Datasets and methods

3.1 Datasets

We conduct experiments on different COVID-19 datasets to verify our method. All experiments are based on 2D slices. Each dataset is tested separately. The details of the four datasets are shown in Table 1.

Table 1 The number of COVID and nonCOVID CT images of the three COVID-19 datasets

The COVID-19 Dataset-A was provided by The Affiliated Hospital of Qingdao University Medical College, including 21 CT scans with COVID-19 and 18 CT scans without COVID-19, which were labeled by experienced doctors. We extracted 2499 slices of positive samples containing COVID-19 infection from CT scans with COVID-19 and randomly selected 1908 slices of negative samples from 18 CT scans without COVID-19 as “NonCOVID” . We conducted slice-level and patient-level experiments. At the slice-level, the dataset was divided into 3527 images for training, and 880 images for verification. At the patient-level, slices from 16 CT scans with COVID-19 and 14 CT scans without COVID-19 were used for training and the rest were used for verification.

The COVID-19 Dataset-B [41] was provided by Ma et al., and it includes CT scans of 20 patients. The infection area was marked by two radiologists and verified by an experienced radiologist. The slices containing COVID-19 infection were extracted as COVID, for a total of 1844. The remaining slices that did not contain COVID-19 were regarded as “NonCOVID”, for a total of 1676. The dataset was divided into 2816 images for training, and 704 images for verification.

The COVID-19 Dataset-C [20] was provided by Zhao et al. The practicability of the dataset was confirmed by a senior radiologist at Tongji Hospital in Wuhan, China, who diagnosed and treated a large number of patients with COVID-19 during the outbreak from January to April. The dataset provides 8-bit CT images of png instead of DICM with the Hu value, which results in resolution loss. Second, the original CT scan contained a series of CT slices, but only a few key slices were selected in the dataset, which also had a negative impact on the diagnosis. The COVID-19 Dataset-C is shown in Fig. 2. This dataset contains 349 COVID-19 CT images and 397 other CT images without COVID-19. The dataset was divided into 597 images for training and 148 images for verification. Because the number of images in the dataset was too small, we enhanced the training data by flipping and rotating.

The COVID-19 Dataset-D [42] is a publicly available COVID-19 CT dataset, containing 1252 CT scans that are positive for SARS-CoV-2 infection (COVID-19) and 1230 CT scans for patients with other pulmonary diseases or normal, 2482 CT scans in total. The dataset was divided into 1984 images for training, 498 images for validation.

3.2 Methodology

The network we proposed is based on the ResNet50 [43]. The architecture of the proposed network is shown in Fig. 3. Similar to ResNet, the feature extraction layers of this network consist of 5 mutex attention Res-Layers followed by the global pooling layer and the fully connected layer. The inputs of this network are designed specifically with a pair of mutex CT images that contain opposite categories. In the training process, the pair of mutex inputs are randomly selected from the training data. Since in the forward process, nothing has to do with mutex input, we do not need to enter mutex input in the testing process.

Fig. 3
figure 3

Architecture of the proposed network

The architecture of the mutex attention Res-Layer is shown in the blue dashed box in Fig. 3. Similar to ResNet, the Res-Layer in mutex attention Res-Layer0 is a 7 × 7 convolution layer with a stride of 2 and a max pool layer with a stride of 2. The Res-Layers in mutex attention Res-Layer1 to mutex attention Res-Layer4 are made up of different numbers of blocks connected in series. In mutex attention Res-Layer0, the inputs are a pair of mutex CT images. The inputs of mutex attention Res-Layer1 to mutex attention Res-Layer4 are the output feature maps of the previous mutex attention Res-Layer. Assume that the input of the mutex attention Res-Layers is Fi and Fm. Frei and Frem are first obtained through the same Res-Layer, and the two Res-Layers share parameters, as shown in the blue dotted frame of Fig. 3. Frei directly obtainsFo. Fam is obtained by putting Frei andFrem into the mutex attention block (MAB). Then, the obtained Fam and Frem are passed through the fusion attention block (FAB) to obtain the output mutex feature maps Fom. The above process can be described by (1) and (2), as follows:

$$ \begin{array}{@{}rcl@{}} \label {equ1} \left\{\begin{array}{cc} F_{r e i} = W_{r l} * F_{i} \\ F_{r e m} = W_{r l} * F_{m} \end{array}\right. \end{array} $$
(1)
$$ \label {equ2} \left\{\begin{array}{c} F_{o}=F_{r e i} \\ F_{o m}=F A B\left( M A B\left( F_{r e i}, F_{r e m}\right), F_{r e m}\right) \end{array}\right. $$
(2)

where Wrl is the Res-Layer parameter. FAB is the fusion attention block. MAB is the mutex attention block. We introduce them in detail as follows.

3.3 Mutex attention block (MAB)

The MAB inputs are the feature maps Frei and Frem (RC×H×W) after the Res-Layer, as shown in Fig. 4. First, perform elementwise subtraction on Frei and Frem to obtain the distance matrix D(RC×H×W), as equation:

$$ d_{c i j}=\left( a_{c i j}^{r e i}-a_{c i j}^{r e m}\right)^{2} $$
(3)

where dcij represents the value of (i, j) of D on c channels, \(a_{c i j}^{r e i}\) and \(a_{c i j}^{r e m}\) represent the value of (i, j) of Frei and Frem on c channels. c ∈ [1, C], i ∈ [1, H], j ∈ [1, W]. Then D(RC×H×W) is reshaped to C × HW and softmax is performed in the spatial dimension (HW). The attention map AD is reshaped to (RC×H×W), as shown in (4).

$$ A_{D}=\text{reshape}(\sigma(\text{reshape}(D), \dim=H W)) $$
(4)
Fig. 4
figure 4

Mutex attention block architecture

σ is the softmax function.

Further perform element-wise multiplication between the obtained attention map AD and the input Frem to obtain Fam, as shown in (5).

$$ F_{a m}=A_{D} \odot F_{r e m} $$
(5)

The purpose of MAB is to reduce the similarity between input feature maps and mutex feature maps. Using the distance between mutex feature maps as an attention map can effectively increase the discrimination of the two inputs. Therefore, the network achieves a better discrimination effect. In the experiments, we visualize the changes in the feature maps to verify the effect of the mutex attention block.

3.4 Fusion attention block (FAB)

The structure of FAB is shown in Fig. 5. The inputs of FAB are Fam obtained by MAB and Frem. First, we use elementwise summation on two inputs to obtain mixed feature maps Fmix.

$$ F_{\text{mix}}=F_{a m}+F_{r e m} $$
(6)
Fig. 5
figure 5

Fusion attention block architecture

Then, global average pooling and max pooling are performed on the obtained mixed feature maps, and element addition is used to obtain the channel feature vector V (RC):

$$ \left\{\begin{array}{c} V_{c 1}=\frac{1}{H \times W} \cdot {\sum}_{i=1}^{H} {\sum}_{j=1}^{W} F_{\text{mix}}^{c}(i, j) \\ V_{c 2}=\max \left( F_{\text{mix}}^{c}(i, j)\right) \\ V_{c}=V_{c 1}+V_{c 2} \end{array}\right. $$
(7)

where Vc is the value on the c-th channel of the channel feature vector V. Furthermore,two fully connected layers are used to perform feature fitting and channel dimension reduction to obtain the channel feature vector Z:

$$ Z=f_{C^{\prime}}^{B R}\left( f_{C}^{B R}(V)\right) $$
(8)

where C in \(f_{C}^{B R}\) represents the number of neurons in the fully connected layer, and BR in \(f_{C}^{B R}\) represents the batch normalization [44] and ReLU [45] layers. The obtained Z is passed through two independent fully connected layers to obtain two weight vectors M and N:

$$ \left\{\begin{array}{l} M=f_{C}(Z) \\ N=f_{C}^{\prime}(Z) \end{array}\right. $$
(9)

where C is the number of neurons in the fully connected layer, and the obtained M and N all belong to RC. Softmax is used for the corresponding channels of M and N to obtain the attention weights A of Fam for each channel. In addition, the attention weights of Frem is 1 − A. Channelwise multiplication and addition are performed together to obtain the output fusion feature maps Ffm, as following:

$$ a_{c}=\frac{e^{M_{c}}}{e^{M_{c}}+e^{{N_{c}}}} $$
(10)
$$ F_{f m}^{c}= a_{c} \times F_{a m}^{c}+(1-a_{c}) \times F_{r e m}^{c} $$
(11)

where ac is the value of channel weight A on the c-th channel , and \(F_{f m}^{c}\) is the feature map on the c-th channel of Ffm, c ∈ (0, C − 1).

The FAB function is to perform feature fusion on two sets of feature maps. Through self-learning parameters, the two sets of input feature maps are weighted in the channel dimension.

3.5 Loss functions

As shown in Fig. 3, our network contains two losses. LCE is the classification loss corresponding to the input image and the mutex input image, and LCS is a cosine similarity loss between the pair of inputs.

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{c} L_{1}=-y_{i} \log y_{i}^{\prime}+\left( 1-y_{i}\right) \log \left( 1-y_{i}^{\prime}\right) \\ L_{2}=-y_{m} \log y_{m}^{\prime}+\left( 1-y_{m}\right) \log \left( 1-y_{m}^{\prime}\right) \end{array}\right. \end{array} $$
(12)
$$ L_{CE}=L_{1} + L_{2} $$
(13)
$$ L_{CS}=\cos (\theta)= \frac{{\sum}_{i=1}^{C \times H \times W}\left( {V_{i}^{i}} \times {V_{i}^{m}}\right)}{\sqrt{{\sum}_{i=1}^{C \times H \times W}\left( {V_{i}^{i}}\right)^{2}} \times \sqrt{{\sum}_{i=1}^{C \times H \times W}\left( {V_{i}^{m}}\right)^{2}}} $$
(14)

where L1 and L2 are cross-entropy loss, yi and ym are opponent labels of the input and mutex input. \(y_{i}^{\prime }\), \(y_{m}^{\prime }\) are the predicted values. LCS uses cosine distance to minimize the similarity between two feature maps. In (14), \({V_{i}^{i}}\) and \({V_{i}^{m}}\) (RCHW) are the vectors obtained by reshaping the feature maps of the input and the mutex input.

In this paper, the adaptive weight loss is shown in (16).

$$ \begin{array}{@{}rcl@{}} \left\{\begin{array}{l} a_{1}=\frac{e^{\frac{1}{L_{C E}}}}{e^{\frac{1}{L_{C E}}}+e^{\frac{1}{L_{C S}}}} \\ a_{2}=\frac{e^{\frac{1}{L_{C S}}}}{e^{\frac{1}{L_{C E}}}+e^{\frac{1}{L_{C S}}}} \end{array}\right. \end{array} $$
(15)
$$ L=a_{1} L_{CE}+a_{2} L_{CS} $$
(16)

In this way, the three loss functions can be adjusted adaptively to make the final loss more balanced.

3.6 Inference process

In the inference process, since the mutex path has no effect on the main path, only the test images need to be input and mutex input is not needed. The prediction network is a simple ResNet50, which is the same as the backbone. Therefore, the time complexity of the proposed method is the same as ResNet50 and smaller than SE-Net and CBAM based on ResNet50. The model size/Flops/speed are also the same as the backbone and our proposed method does not increase any cost in the test process.

4 Experiment

We provided experiments with four datasets: COVID-19 dataset-A, COVID-19 dataset-B, COVID-19 dataset-C and COVID-19 dataset-D. For COVID-19 dataset-A and COVID-19 dataset-B, to verify the effectiveness of the attention we proposed, we used the state-of-the-art attention methods in image classification and backbone for comparison. For public COVID-19 dataset-C and COVID-19 dataset-D, we compared the excellent algorithms proposed by other researchers, both of which achieved good results on COVID-19 dataset-C or COVID-19 dataset-D. We show the results from quantitative analysis and qualitative analysis. For qualitative analysis, we used Grad-CAM [47] for visualization.

4.1 Experimental details

The proposed method was implemented in PyTorch [46]. All the COVID-19 CT images in our experiments were resized to 224 × 224. Due to the small amount of data on COVID-19 dataset-C, we performed data enhancement including horizontal flip, vertical flip and random rotation. Both inputs and mutex inputs during the training process were randomly selected. We used SGD [47] as the optimization function with an initial learning rate of 0.01. The learning rate decays with the epoch, and the change equation is shown as follows:

$$ l r_{d}=\text{lr} \times \left( 0.5^{\frac{e p o c h}{30}}\right) $$
(17)

where epoch is the number of iterations. Momentum was set to 0.9. The batch size was 16, and 120 epochs were run. An NVIDIA GeForce GTX 1070 GPU with 8 GB memory was used. In the test process, only the test images needed to be input. Since in the forward process, nothing has to do with mutex input, we did not need to enter mutex input in the testing process.

For the training process, COVID-19 dataset-A, COVID-19 dataset-B and COVID-19 dataset-C took 3.1 hours, 2.5 hours and 2.1 hours respectively. For the testing process, it took 0.032 seconds to an image.

4.2 Evaluation

The metrics employed to quantitatively evaluate classification were accuracy, sensitivity, F1-score, AUC and specificity. The equation of accuracy is as follows:

$$ \text { Accuracy }=\frac{\text {TN}+\text {TP}}{\text {N}+\text {P}} $$
(18)

The sensitivity and specificity measure the classifier’s ability to identify positive samples and negative samples, respectively, as shown in (19) and (20):

$$ \text {Sensitivity} =\frac{\text {TP}}{\text {FN}+\text {TP}} $$
(19)
$$ \text {Specificity} =\frac{\text {TN}}{\text {FP}+\text {TN}} $$
(20)

F1-score is the classification problems measurement. In machine learning competitions with multiple classification problems, the F1-score is often used as the final evaluation method. It is a harmonic average of the precision and recall, with a maximum of 1 and a minimum of 0.

$$ \text {Precision}=\frac{\text {TP}}{\text {TP}+\text {FP}} $$
(21)
$$ \text {Recall}=\frac{\text {TP}}{\text {TP}+\text {FN}} $$
(22)
$$ F 1=2 \times \frac{\text {Precision } \times \text { Recall}}{\text {Precision }+\text { Recall}} $$
(23)

5 Results

5.1 COVID-19 dataset-A

COVID-19 Dataset-A was provided by The Affiliated Hospital of Qingdao University Medical College. Detailed information was introduced in the previous section (Section 3.1). We used state-of-the-art attention methods in image classification and backbone for comparison, including ResNet [43], ResNet+CBAM [40] and ResNet+SE [39]. The attention module proposed by Ma et al. [29] for COVID-19 diagnosis is also compared. We conducted slice-level and patient-level experiments to verify the effectiveness of our method. Table 2 and 3 summarize the experimental results.

Table 2 Slice-level results of different methods on COVID-19 Dataset-A
Table 3 Patient-level results of different methods on COVID-19 Dataset-A

For the slice-level experiment, Table 2 shows that compared with the basic network ResNet, our method improved accuracy, sensitivity and specificity, increasing by 3.07%, 2.41% and 3.96%, respectively. Compared with SE-Net, our method improved accuracy, sensitivity and specificity by 2.16%, 0.6% and 4.20%, respectively. For ResNet + CBAM, 2.67%, 2.81% and 2.63% improvements were obtained. Compared with Ma et al. [29], the proposed attention blocks obtained more satisfactory results in terms of accuracy, sensitivity and specificity. The results illustrate that compared to the widely used classification networks and attention modules, our proposed method had better accuracy and the least false-positives. Fig. 6 shows the ROC curve (left) and confusion matrix (right) of the different methods.

Fig. 6
figure 6

The ROC curve (left) and confusion matrix (right) of different methods on COVID-19 dataset-A at the patient-level

As seen in Table 3, first, the proposed method obtained better results at the patient-level compared to state-of-the-art methods, which is similar to the slice-level experiment. Second, there was little difference between patient-level and slice-level experimental results, which verifies the effectiveness of the proposed method.

In particular, we discovered that SE-Net tended to obtain higher sensitivity, while ResNet + CBAM tended to have higher specificity. Therefore, the probability of missing detection was lower than that of ResNet + CBAM, but ResNet + CBAM had fewer false-positives. From statistical analysis, all p values were less than 0.001, which confirmed that our method was significantly different from other classification methods.

5.2 COVID-19 dataset-B

COVID-19 dataset-B [41] was provided by Ma et al. Similar to COVID-19 dataset-A, we also used the state-of-the-art attention modules in image classification and backbone for comparison, including ResNet [43], ResNet+CBAM [40] and ResNet+SE [39]. Table 4 summarizes the experimental results.

Table 4 Experimental results of different methods on COVID-19 dataset-B

Similar to the results obtained on COVID-19 dataset-A, the proposed method achieved better accuracy, sensitivity and specificity. Comparing with ResNet, ResNet + CBAM and SE-Net, the proposed methods improved accuracy by 6.98%, 4.26% and 8.81%, and improved sensitivity by 2.59%, 7.04% and 1.90%, respectively. The specificity increased by 12.33%, 1.20% and 16.42%, respectively. Fig. 7 shows the ROC curve (left) and confusion matrix (right) of different methods.

Fig. 7
figure 7

The ROC curve (left) and confusion matrix (right) of different methods on COVID-19 dataset-B

5.3 COVID-19 dataset-C

COVID-19 dataset-C [20] was provided by Zhao et al. It is a public dataset used for COVID-19 diagnosis. The characteristics of COVID-19 Dataset-C were explained in detail in Section 3.1. Due to the small amount of data in it, we choose to compare with the papers that propose different solutions. Yang et al. [20], [32], [34] and [35] were selected for comparison due to their excellent performance. Table 5 summarizes the experimental results.

Table 5 Experimental results of different methods on COVID-19 dataset-C

Table 5 reports that compared to the method proposed in [20], our method achieves better accuracy, F1-score and AUC without the assistance of a lung mask. Compared to the method proposed in [34], our method used random initialization, and achieved better results without pretraining on other datasets. Both the F1-score and AUC value were significantly improved. The results confirmed that our method can achieve satisfactory results without using other complex methods when the dataset was small.

5.4 COVID-19 dataset-D

COVID-19 dataset-D [42] was provided by Soares et al. It is a public dataset used for COVID-19 diagnosis. The characteristics of COVID-19 Dataset-D were explained in detail in Section 3.1. Some researchers have achieved excellent results on this dataset. we focused on comparing with papers that propose new methods. Wang et al. [35], [29] and [48] were choose for comparison. We also used the state-of-the-art attention modules in image classification for comparing. Table 6 summarizes the experimental results.

Table 6 Experimental results of different methods on COVID-19 dataset-D

Experimental results illustrated that the proposed network achieved better result than other methods and attention models. Comparing with state-of-the-art methods, our method achieves better accuracy, F1-score and AUC. The experimental results illustrated the effectiveness of the proposed method for COVID-19 discrimination.

5.5 Ablation experiments

To verify the effectiveness of our proposed loss and attention blocks, we conducted two ablation experiments. For the loss function, we verified the influence of LCS on the diagnosis results. The results are shown in Table 7.

Table 7 Diagnosis results using different loss functions (based on COVID-19 dataset-B)

The experimental results show that the cosine similarity loss LCS can improve the diagnosis result. Especially when combined with the adaptive weighting method, it can significantly improve the accuracy.

We also performed ablation experiments on the influence of different attention blocks. Table 8 shows that MAB significantly improved the diagnostic effect of the network on COVID-19, and FAB further improved the diagnostic effect of the network on this basis. This result was consistent with the visualized result in Fig. 9.

Table 8 Diagnosis results using different attention blocks (based on COVID-19 dataset-B)

6 Discussion

In this work, we proposed a new Res-Layer structure for COVID-19 diagnosis on CT images named mutex attention Res-Layer. It is composed of MAB, FAB and a Res-Layer and can extract more distinguishable features to obtain better COVID-19 diagnosis results with the proposed adaptive weight loss. Experiments on four different COVID-19 datasets verified that our method can achieve better results than other state-of-the-art methods. This reflects the effectiveness and robustness of the proposed method.

For the more regular CT image datasets COVID-19 dataset-A and COVID-19 dataset-B [41], the proposed network obtains 98.18% and 95.88% accuracy, which achieves significant improvement compared to other state-of-the-art attention models. Furthermore, the proposed method achieves higher sensitivity and specificity, which means that our method can better avoid the existence of false negatives (FNs) and false-positives (FPs) than other methods. As shown in Table 2, the experimental results indicate that ResNet+CBAM [40] tends to obtain better specificity. In other words, it will produce many FNs, which is not conducive to COVID-19 diagnosis. Although SE-Net [39] can obtain very good sensitivity, only 0.6% lower than our method, it produces a large number of FPs. The specificity of SE-Net is 4.2% lower than our method. Both experiments demonstrate the superiority of the proposed method.

For the noisy dataset, COVID-19 dataset-C [20], the proposed method also obtains satisfactory results. Compared with the methods proposed by [34], [20], our method can obtain higher accuracy, F1-score and higher AUC with fewer labels than [34] and without transfer learning, as in [20]. This shows that our method can still obtain satisfactory results even when the quantity of data is small and the noise is complex, which fully demonstrates the robustness of our method. For another public COVID-19 dataset-D [42], more satisfactory results are obtained than other state-of-the-art algorithms.

This article proposes two new attention blocks, MAB and FAB. The experimental results indicate the effectiveness of the two attention blocks. In Table 8, compared with the basic ResNet, MAB can significantly improve the accuracy and specificity by 4.36% and 10.24% respectively. MAB can amplify the degree of feature differences in various categories, and tends to obtain fewer FPs. Based on MAB, FAB performs feature fusion to obtain more representative features.With the improvement in accuracy by 1.62% and sensitivity by 3.22%, specificity was nearly unchanged.

To confirm the results of the experiments, we visualized the feature maps of different categories extracted by the network and their COVID-19 images in Fig. 8. We use Grad-CAM [47] to visualize the feature maps outputted by the mutex attention Res-Layer4.

Fig. 8
figure 8

Visualization results of the proposed network using Grad-CAM. The first line is the input CT images; the second line is the visualization results

The visualization results show that our method focuses on the lesion area more accurately. For example, the lesion area in Column 3 was very small, while the lesion area in Column 4 was diffuse in both lungs. From the visualization results of the feature maps in the two columns, we can see that the proposed network is robust to lesions of different sizes. For different types of lesions, our network also achieves satisfactory localization. For example, the first column is mixed ground glass opacity, the third column is ground glass opacity and the second column is solid. The high response areas of these visualizations approach the lesion area. Visualization results confirm that our method can extract more representative features to achieve effective COVID-19 diagnosis, which proves the effectiveness of our method.

We also visualized the influence of different attention blocks on the feature maps. We extracted the feature maps outputted by the mutex attention Res-Layer4 for visualization. Since the output feature map size of mutex attention Res-Layer4 was [2048,7,7], we use the average superposition method to obtain a feature map 7 × 7. The final results were obtained by upsampling to 224 * 224 using bilinear interpolation, as shown in Fig. 9. It shows that MAB focuses the features on the most significant lesion region, but it cannot completely cover some subsidiary lesion areas. For example, in the second row, the original feature map has a diffuse appearance on both sides of the lung, but its coverage area far exceeds the lesion area. As shown in the third column, MAB focuses the feature most on right lung, which is the most distinctive area. The region with the highest response shrank to the lesion area. However, the lesion area in the left lung was ignored. For this problem, the FAB module may provide an explainable improvement. FAB performs feature fusion between the original feature maps and the MAB feature maps. It can also be described as the feature selection of the channel dimension. As the same example, the visualization in the fourth column of the second row shows that FAB focuses on the right lung as well as on the left lung, which can enhance the network’s diagnostic effectiveness. Furthermore, the proposed method will be helpful to precisely localizing and segmenting lesions.

Fig. 9
figure 9

Visualization results of feature maps after using different attention blocks in mutex attention Res-Layer4. The first column is the input CT images, the second column is the visualization of features after Res-Layer, the third column is the visualization of features after MAB, and the fourth column is the visualization of features after FAB

7 Conclusion

This paper proposed the COVID-19 diagnosis network MA-Net, which takes a pair of multiple CT images as inputs. In particular, this paper proposed a mutex attention block. The mutex attention block aims to distinguish the features of mutex input pairs. The network can extract distinguishable features to improve the diagnostic effect of the network. Then, the fusion attention block is designed to perform feature fusion further improving the diagnosis accuracy. Regarding the loss function, the proposed network includes three losses, namely the cross-entropy classification loss of the two mutex inputs and the cosine similarity loss of mutex input pairs. Adaptive weight is used to adjust the weights of the three losses. Our method is very robust and achieved satisfactory results on multiple datasets. The accuracy, specificity, sensitivity, and AUC are reported as high as 98.17%, 97.25%, 98.79% and 99.84% on our own COVID- 19 dataset-A and 94.88%, 95.39%, 94.33% and 99.00% on the public COVID-19 dataset-B. The diagnosis result of our method is better than state-of-the-art the attention modules of classification method. For public COVID-19 dataset-C and COVID-19 dataset-D, we achieved better result than other excellent COVID-19 diagnosis methods. The experiments and analysis indicate that the proposed network can provide auxiliary quantitative analysis in COVID-19 diagnosis.