Abstract
In the context of COVID-19 pandemic prevention and control, it is of vital significance to realize accurate face mask detection via computer vision technique. In this paper, a novel attention improved Yolo (AI-Yolo) model is proposed, which can handle existing challenges in the complicated real-world scenarios with dense distribution, small-size object detection and interference of similar occlusions. In particular, a selective kernel (SK) module is set to achieve convolution domain soft attention mechanism with split, fusion and selection operations; a spatial pyramid pooling (SPP) module is applied to enhance the expression of local and global features, which enriches the receptive field information; and a feature fusion (FF) module is utilized to promote sufficient fusions of multi-scale features from each resolution branch, which adopts basic convolution operators without excessive computational complexity. In addition, the complete intersection over union (CIoU) loss function is adopted in the training stage for accurate positioning. Experiments are carried out on two challenging public face mask detection datasets, and the results demonstrate the superiority of the proposed AI-Yolo against other seven state-of-the-art object detection algorithms, which achieves the best results in terms of mean average precision and F1 score on both datasets. Furthermore, effectiveness of the meticulously designed modules in AI-Yolo is validated through extensive ablation studies. In a word, the proposed AI-Yolo is competent to accomplish face mask detection tasks under extremely complex situations with precise localization and accurate classification.
Keywords: Face occlusion detection, Computer vision, Attention mechanism, Multi-scale feature fusion
1. Introduction
Under the background of continuous mutation and transmission of the Corona-virus Disease 2019 (COVID-19) [1], worldwide pandemic prevention work are suffering great challenges, and to block the spreads of COVID-19 virus has become an increasingly significant and urgent issue. Due to the respiratory droplet is one of the primary routes of virus transmission, wearing masks is an effective and convenient way to defend against COVID-19. In public places, although manual inspection can directly monitor whether people are wearing masks, which is somehow a waste of manpower.
Benefit from the rapid development of deep learning techniques [2], [3], [4], [5], [6], the computer vision (CV)-based object detection algorithms can serve as competent alternatives, which have flourished in the fields of recognition, detection, and segmentation for images/videos [7], [8], [9], [10]. As a result, it is promising to apply CV methods into the mask detection tasks, which may present better performance than manual inspection by simultaneously realizing the inspections of whether there is a mask, whether the mask has protective function, and whether the mask is worn in a correct position.
In general, according to design ideas, two-stage and one-stage methods are two mainstream object detection algorithms. As the name suggests, two-stage methods usually realize object detection by two steps, where the first one is to generate region proposals, and the second one is to obtain detection bounding boxes accordingly. Pioneering two-stage methods are the region convolutional neural network (RCNN) series algorithms [11], [12], [13], which have derived many other variants like cascade RCNN [14], libra RCNN [15] and mask RCNN [16], etc. Whereas in one-stage methods, the region proposal step is left out and through the dense anchors, category probability and position coordinate value of the targets are directly outputted. Representative work includes single shot detector (SSD) [17] and YOLO (You Only Look Once) series algorithms [18], [19], [20], [21], [22]. It is noticeable that the third version of YOLO (YOLOv3) is a milestone work with wide applications [23], [24], which has combined advantages of both one- and two-stage methods. To be specific, a novel feature extraction network DarkNet-53 has been proposed in YOLOv3, which adopts three branches to enrich information of receptive fields and realize multi-scale object detection, and moreover, precision and inference speed have been further balanced as well.
Although aforementioned algorithms have realized some successful applications in face mask detection tasks [25], some important issues remaining challenging are listed as follows.
(a) It is difficult to distinguish masks and mask-like occlusion objects. In real-world scenes, people may wear veils, headscarves and palms which also cover their mouths or noses, which may be wrongly identified as face masks by the algorithms.
(b) Detection scenes are often complex and changeable. In dense crowd, recognition of masks will become a challenging long-distance small-sized object detection task. In addition, movement of the crowd and changes of illumination can even cause low quality of the input images. Therefore, complex situations make it more tough to realize an accurate detection of masks.
(c) It is also important to detect whether masks are worn in a correct manner. Only correctly wearing the masks can effectively block spread of the virus, while most of the existing face mask detection methods can only identify whether a mask is worn or not.
To handle above issues, based on the Yolo framework, a novel attention improved Yolo (AI-Yolo) method is proposed in this paper for face mask detection, which mainly aims at distinguishing masks from other occlusion objects in complex real-world scenarios. In particular, a selective kernel (SK) module is introduced to provide an adaptive attention mechanism in convolution domain, and a spatial pyramid pooling (SPP) module is set to enhance feature expression ability of the model for both local and global information. Moreover, the obtained multi-scale features are sufficiently fused in a feature fusion (FF) module so as to avoid performance degradation. In addition, loss function in AI-Yolo has taken various geometric parameters of the bounding boxes into consideration. It is noticeable that AI-Yolo cannot only recognize whether there is a mask, but also detect whether the mask is correctly worn.
The main contributions of this paper are listed as follows.
(1) A novel face mask detection framework AI-Yolo is proposed, where convolution domain soft attention mechanism, enhanced expression of local and global information, multi-scale feature fusion and accurate positioning capability are realized in corresponding meticulously designed modules.
(2) The proposed AI-Yolo effectively solves the predicaments of current mask detection methods, including low precision in complex environment, difficulty to distinguish the masks from various kinds of other occlusions, and inability of accurately recognizing whether the masks are worn correctly.
(3) Experimental results on two public mask detection datasets fully demonstrate the superiority of AI-Yolo, which outperforms other seven state-of-the-art object detection algorithms, and the effectiveness of components in AI-Yolo is validated through ablation study as well.
The remainder of this paper is organized as follows. Related work is introduced in Section 2, and comprehensive expatiation of the proposed AI-Yolo is presented in Section 3, including overall framework and elaborations of each vital component. Experimental results and in-depth discussions are provided in Section 4, and finally, conclusions are drawn in Section 5.
2. Related work
In this section, preliminaries of YOLOv3 algorithm are presented for a better understanding, and some representative mask detection methods are introduced as well.
2.1. Principle of YOLOv3 algorithm
YOLOv3 [20] is a masterpiece of one-stage detection algorithms, of which the main innovations include optimization of feature extraction network and implementation of multi-scale training, which achieves a trade-off between detection speed and accuracy. To be specific, YOLOv3 has adopted a feature extractor DarkNet-53, which borrows the idea of jump connection in residual network (ResNet) [26] and contains 53 layers with multiple residual blocks. Furthermore, DarkNet-53 has alleviated expensive computation and improved inference speed by reducing the number of both convolution layers in residual blocks and channels of feature maps. In addition, similar to the feature pyramid network (FPN) [27], up-sample operations have been employed to fuse deep and shallow features, which outputs feature maps with multi-scale resolutions including and 52 × 52 so as to achieve detection of targets in large, medium and small size, respectively.
As for the multi-scale training process in YOLOv3, firstly, the input images are divided into grids in size of and 52 × 52, each of which is corresponding to a multidimensional tensor. Secondly, for every vector three anchors are assigned, where parameters are included in each anchor. Specifically, the first five parameters depict the horizontal and vertical coordinate values of the center point, the width and height of anchor, and the confidence of bounding box, and is the number of categories. Finally, bounding box of each target is generated, and category probability is predicted as well.
It is noticeable that three parts are contained in the loss function of YOLOv3, including localization loss, confidence loss and classification loss. The last two are based on the cross-entropy loss function, and the first one is defined as follows:
(1) |
where denotes the Sigmoid activation function and is a weighting factor. For each divided grid, is the size and is the number of generated prior bounding boxes. takes value from , which indicates whether the th prior box of the th grid is responsible for predicting the target; width and height ratios of the target to input image are denoted as and , respectively. For a prior bounding box, and represent the width and height, and are horizontal and vertical coordinates of the center point. Notice that the above four parameters are prediction values, and accordingly, , , , are the true values of corresponding parameter.
2.2. Existing mask detection methods
Recently, with the practical demands of pandemic control and the prosperous development of CV detection algorithms, plenty of methods have emerged for face mask detection tasks.
For one-stage frameworks, in [28] a YOLO-Mask detection algorithm has been proposed, which integrates attention mechanism into the feature extraction network to enhance the ability of representing salient features, which is designed to obtain enhanced features so as to achieve better detection performance and robustness. In [29], the authors have proposed an improved mask-wearing detection model based on YOLOv3, where spatial pyramidal pooling (SPP) structure is adopted to promote feature fusion at different levels, and experimental results have shown the strong robustness of the proposed model in complex cases. In [30], [31], fine-grained semantic features and detail information have been obtained through the strategies of broadening the feature pyramid structure and adding the feature fusion path, which is available in tiny face masks detection scenarios. In [32], a novel face mask detection Yolo (FMD-Yolo) model has been developed, which meticulously designs the input, backbone, neck, detection head and post-processing parts. Experimental results have demonstrated the strong generalization ability of FMD-Yolo in various kinds of complicated scenes with high detection precision.
For two-stage methods, a transfer learning technology based on fast RCNN has been designed in [33] to realize automatic location and type recognition of face masks. A multi-task learning framework based on mask RCNN and full convolution neural network has been proposed in [34], which compensates for the deficiency that single detection task cannot achieve pixel-wise face segmentation. In [35], a two-stage hybrid transfer learning approach has been developed, which completes the coarse extraction of face mask candidate regions based on RCNN and InceptionV2 at first, and then implements the mask wearing detection task by designing a broad learning system with a binary classification model. It has been proven that the hybrid model can present excellent performance even in complicated scenarios.
3. Implementation details of AI-Yolo
In this section, the proposed AI-Yolo is elaborated with comprehensive expatiation of the three main modules therein, which are SK, SPP and FF module. To begin with, the overall framework of AI-Yolo is illustrated, and other implementation details are given subsequently.
3.1. Overall structure of the proposed AI-Yolo
In Fig. 1, the framework of AI-Yolo is illustrated, whose backbone network adopts the DarkNet-53 in YOLOv3, which consists of one DBL (DarkNetConv2D BatchNormalization LeakyRelu) block and five Res-n (Residual block with n Res units) blocks. It is noticeable that via stacking a large number of 1 × 1 and 3 × 3 convolution layers, DarkNet-53 can effectively decline computational complexity and maintain feature extraction ability simultaneously.
According to Fig. 1, outputs of the last three stages in DarkNet-53 are fed into three SK modules independently, where dynamic selection of convolution kernels is completed. In particular, by adaptively adjusting weight distribution of convolution kernels on different feature maps, an effective information differentiation under the kernel attention mechanism can be realized in the SK module. Next, preliminary feature fusion is accomplished through up-sample, concatenation operations and DBL block, after which the outputs are sent into the SPP module that is committed to promoting sufficient fusion of multiple receptive fields and enhancing the expression ability of both global and local information. Finally, the FF module is designed to fully integrate the enhanced features so as to improve the robustness of model on detecting multi-scale objects, and more details of each module are presented in the following subsections.
3.2. Selective kernel (SK) attention mechanism
Since the face masks and other occlusions have different shapes and sizes in real-world scenes, how to obtain dynamic weight adjustment information so as to make the detector adapt to different complex environment deserves further attention. Consequently, SK attention mechanism [36] is employed in AI-Yolo, which is placed at the beginning of branches with different resolutions of inputs. The implementation of SK module is illustrated in Fig. 2, where “Split”, “Fuse” and “Select” operations are conducted, and the details are given as follows.
-
•
Split. An input feature map is split into two branches, where a group of 3 × 3 and 5 × 5 convolution kernels is set, respectively. Batch normalization is performed in both branches, and ReLU is adopted as the activation function, the obtained feature maps after transformation are denoted as and , respectively.
-
•Fuse. Global average pooling (GAP) and fully connected (FC) operations are carried out in sequence on and so as to achieve global information fusion of different kernel features, which can be depicted as:
where is the output of “Fuse” operation and is the element-wise summation.(2) -
•Select. Aiming at , information with different spatial scales is adaptively selected as:
where denotes the final feature map in th channel, and represent the soft attention vectors of and , respectively.(3)
3.3. Spatial pyramid pooling (SPP)
In the face mask occlusion detection task, the excessive differences among target sizes make the accurate small object (i.e., the mask) detection an extremely challenging work. To handle this issue, spatial pyramid pooling module [37] is applied in AI-Yolo to realize the fusion of local and global information, which can effectively avoid the distortion of feature maps caused by down-sample and other operations. As is shown in Fig. 3, three groups of parallel max pooling operations constitute the main framework of the SPP module.
3.4. Feature fusion (FF) module
In general, the shallow feature maps have a small receptive field with rich detail information, while the deep feature maps have a large receptive field with abstract semantic information. Therefore, to integrate the multi-scale features will no doubt benefit the model performance, while it is noticeable that a simple concatenation may cause performance degradation in some complicated situations. In this regard, advantages of individual scale features are fully exploited in the designed FF module in AI-Yolo, of which the structure is displayed in Fig. 4.
As is shown, convolution, up- and down-sample operations are mainly adopted in the FF module to promote the effective multi-scale feature fusion without increasing too much complexity, which can be formulated as:
(4) |
where and represent the input and output feature maps, denotes a 1 × 1 convolution with stride equaling one, and refer to the up-sample and down-sample operations, respectively. As a result, FF module can effectively avoid the performance degradation caused by the lack of semantic information or the missing of detailed features.
3.5. Loss function in AI-Yolo
It is worth mentioning that in addition to the above mentioned SK, SPP and FF modules, which aim at model optimization from the perspective of structural configuration, the loss function also plays an important role in determining the overall model performance. Considering that the original loss function of YOLOv3 is susceptible to the influence of target scales, in the proposed AI-Yolo, complete intersection over union (CIoU) loss [38] is utilized, which takes three important geometric factors of the bounding box into account, including overlap region, centroid distance and aspect ratio [39]. Calculation of CIoU loss is given as:
(5) |
where and are the center point coordinates of the prediction box and the ground-truth one, respectively, measures the Euclidean distance. denotes the diagonal length of the smallest outer rectangle of prediction box and the ground truth. is a weighting factor and refers to similarity of the aspect ratio metric, which are calculated as:
(6) |
(7) |
where , and , are width and height of the prediction box and the ground-truth box, respectively. Degree of overlap between above two boxes is measured by as:
(8) |
where and represent the area of prediction box and the ground-truth box, respectively.
4. Experiments and discussions
In this section, extensive experiments are carried out to comprehensively verify the effectiveness and superiority of the developed AI-Yolo framework.
4.1. Experimental settings
To evaluate performance of proposed AI-Yolo, two public face mask detection datasets are used in this paper, which are denoted as WMD-1 and WMD-2 for convenience. The former is a combination of WIDER FACE [40] and MAFA [41] datasets, and the latter is accessible on website https://www.kaggle.com/datasets/andrewmvd/face-mask-detection, more details of the above two datasets are provided in Table 1.
Table 1.
Training | Validation | Number of real boxes marked per category |
|||||
---|---|---|---|---|---|---|---|
set | set | ||||||
WMD-1 | 6120 | 1530 | 7820 | 8318 | – | ||
WMD-2 | 683 | 170 | – | 567 | 2388 | 107 |
To be specific, WMD-1 includes two categories, which are labeled with no_mask () and mask (); whereas WMD-2 contains three classes of without_mask (), with_mask () and mask_weared_incorrect (). It is worth mentioning that accurate classification on both of the two datasets are arduous due to the great environment interference such as small size and dense distribution of objects, etc. In addition, there are 6120, 1530 samples in training and validation set of WMD-1, and 683, 170 samples in those of WMD-2, respectively.
Furthermore, during the training stage, the Adam optimizer is utilized to update the network weights, and 200 epochs are set for training iteration, where the parameters of backbone are frozen in the first 100 epochs. The initial learning rate is set to with 0.1 decay factor for the last 100 epochs, and the initial batch size is defined as 8, which is adjusted to 4 with the decay of learning rate.
All experiments in this study are run under the deep learning framework PyTorch 1.2.0 with python 3.6. The hardware configuration is Nvidia GeForce GTX 1080Ti graphics card with 11 GB of video memory and Inter Xeon(R) E5-2682 processor on the Windows 10 operating system.
4.2. Evaluation metrics
In an object detection task, it is common to evaluate performance of algorithms from two aspects, which are classification accuracy and localization precision. In this paper, mean average precision () and F1 score are selected as evaluation metrics to measure the performance of the proposed AI-Yolo. Both of the two indicators are based on precision () and recall (), where the former refers to that in all samples predicted to be a certain category, how many predictions are correct; whereas the latter refers to that in all samples belonging to a certain category, how many of them are correctly classified. Calculation of is given as:
(9) |
where is the number of target categories. and stand for the precision and recall values of the th category, respectively.
Another metric F1 score is the harmonic average of and , which is regarded as a criterion for evaluation generalization ability and can be obtained as:
(10) |
In addition, frames per second (FPS) is employed to measure the model inference speed, and the model complexity is evaluated by model size and the giga floating-point operations per second (GFLOPs) in this study.
4.3. Results and analysis
First of all, change curves of loss function of the proposed AI-Yolo and original YOLOv3 before and after smooth operation are displayed in Fig. 5, which can reflect the convergence tendency when training the model. As is shown, the proposed AI-Yolo can reach a lower stable loss value as compared with YOLOv3, which means that the prediction of the developed AI-Yolo is more accurate and the convergence ability is stronger than those of YOLOv3.
To further verify the superiority of AI-Yolo, some other state-of-the-art one- and two-stage algorithms are employed for comparisons on the same datasets, including Faster R-CNN(ResNet50) [13], Faster R-CNN(VGG16) [13], deconvolutional single shot detector (DSSD) [42], EfficientDet-D0 [43], YOLOv3 [20], YOLOv4 [21] and RetinaNet [44]. Comparison results in terms of and FPS on the two datasets are reported in Table 2, Table 3, respectively, where the IoU detection threshold of the target and prediction frame is set to 0.5.
Table 2.
Models | AP |
mAP | FPS | |
---|---|---|---|---|
no_mask () | mask () | |||
Faster R-CNN(ResNet50) | 0.876 | 0.788 | 0.832 | 18.7 |
Faster R-CNN(VGG16) | 0.865 | 0.863 | 0.864 | 20.64 |
DSSD | 0.811 | 0.781 | 0.796 | 48.25 |
EfficientDet-D0 | 0.87 | 0.826 | 0.845 | 36.17 |
YOLOv3 | 0.911 | 0.854 | 0.878 | 41.34 |
YOLOv4 | 0.938 | 0.905 | 0.921 | 45.72 |
RestinaNet | 0.924 | 0.89 | 0.906 | 28.69 |
AI-Yolo(Ours) | 0.927 | 0.956 | 0.941 | 42.86 |
Table 3.
Models | AP |
mAP | FPS | ||
---|---|---|---|---|---|
without_mask () | with_mask () | mask_weared_incorrect () | |||
Faster R-CNN(ResNet50) | 0.806 | 0.831 | 0.827 | 0.821 | 18.72 |
Faster R-CNN(VGG16) | 0.817 | 0.822 | 0.836 | 0.825 | 20.63 |
DSSD | 0.716 | 0.788 | 0.786 | 0.763 | 48.24 |
EfficientDet-D0 | 0.813 | 0.83 | 0.829 | 0.824 | 36.14 |
YOLOv3 | 0.801 | 0.846 | 0.834 | 0.827 | 41.36 |
YOLOv4 | 0.851 | 0.895 | 0.823 | 0.856 | 45.74 |
RestinaNet | 0.786 | 0.817 | 0.806 | 0.803 | 28.7 |
AI-Yolo(Ours) | 0.862 | 0.945 | 0.917 | 0.907 | 42.87 |
According to Table 2, Table 3, on dataset WMD-1, the proposed AI-Yolo achieves the best value of 94.1%, and as compared with the sub-optimal model YOLOv4, the performance is improved by 2%. On dataset WMD-2, the proposed AI-Yolo also achieves the best of 90.7% that outperforms the sub-optimal model by nearly 5.1%. In terms of detection speed, our AI-Yolo ranks third out of the eight models, which is still a satisfactory results. Therefore, it can be concluded that the proposed AI-Yolo is a competitive object detection model with excellent overall performance, which may due to that the designed three modules effectively highlights the important features and reduces redundant information. Moreover, the improved loss function may also promote the stable target frame regression and accurate localization.
According to Fig. 6(a), the proposed AI-Yolo can also well balance the precision and recall, which keeps relatively high F1 scores on each category. To be specific, on both categories of the WMD-1 dataset, the F1 score obtained by AI-Yolo reaches over 90%, which performs the best among all of the eight algorithms. On dataset WMD-2, as compared with the advanced YOLOv4 model, our AI-Yolo obtains equivalent F1 score on category , while on the other two categories, AI-Yolo yields better results, which show the superiority of the proposed model. In addition, the average F1 scores for all categories on the two datasets are illustrated in Fig. 6(b), and it can be clearly observed that the proposed AI-Yolo obtains the best mean F1 score of 90% and 79% on dataset WMD-1 and WMD-2, which improves the performance by 2% and 7% in comparison to the second-ranked YOLOv4 model, respectively.
It is noticeable that the three-category classification task in dataset WMD-2 is more difficult than that in WMD-1, while the improvement in terms of F1 score on WMD-2 is even more significant than that on WMD-1. The reasons for this phenomenon may lie in the applications of the designed SK, SPP and FF modules, where sufficient multi-scale enhanced feature fusion is achieved, which further promotes the model performance in dealing with interference of similar occlusions and handling the small object detection tasks. Consequently, the proposed AI-Yolo is proven to have strong robustness that can well adapt to complicated face mask detection scenarios.
Furthermore, the precision–recall (P–R) curves of AI-Yolo on WMD-1 and WMD-2 datasets are presented in Fig. 7, which provide a category-wise evaluation including the listed five classes in Table 1. It is worth mentioning that the P–R curve of a certain class can comprehensively reflect the model performance, where the area enclosed by the curve and coordinate axes represents the average precision () of corresponding class. In Fig. 7, for a better view, aforementioned area has been filled with blue shadows, and the values of each category are 92.73%, 95.56%, 86.15%, 94.37% and 91.66%, respectively. In general, the value of precision tends to decrease with the increasing threshold of recall, but it is observed that the developed AI-Yolo is able to maintain a high level of on each category as increases, which further demonstrates the practicality of the developed AI-Yolo model in complex and uncertain detection scenes with various kinds of mask occlusions. In Fig. 8, the P–R curves of other models are illustrated as well, and it is found that the P–R curve obtained by the proposed AI-Yolo encloses the largest area along with the coordinate axes on both datasets, which indicates that our AI-Yolo has strong robustness and satisfactory generalization ability.
In addition, the model complexity of the proposed AI-Yolo is also compared with other seven advanced models, which is characterized by model size and GFLOPs, and the results are displayed in Table 4.
Table 4.
Models | Model size (MB) | GLOPs (G) |
---|---|---|
Faster R-CNN(ResNet50) | 110 | 297.25 |
Faster R-CNN(VGG16) | 137 | 370.21 |
DSSD | 26 | 62.747 |
EfficientDet-D0 | 4 | 5.234 |
YOLOv3 | 235 | 66.171 |
YOLOv4 | 244 | 60.527 |
RestinaNet | 139 | 68.809 |
AI-Yolo(Ours) | 80 | 92.69 |
As can be seen from Table 4, AI-Yolo has the model size of 80MB, which is 150MB fewer than that of the classical YOLOv3 model. Combining with the results in Table 2, Table 3, the of AI-Yolo is even 6% and 8% higher than that of YOLOv3 on both two datasets, which shows that the developed AI-Yolo can effectively balance the computational costs and detection accuracy. It may owe to the proposed AI-Yolo has effectively reduced the parameters by employing SK attention mechanism, which enriches the perceptual field information so as to enhance the representation of both local and global features. Although the proposed AI-Yolo is not the lightest model, the size is still 27.5%, 41.8%, 67.3%, 42.6% smaller than that of Faster R-CNN(ResNet50), Faster R-CNN(VGG16), YOLOv3, YOLOv4, RetinaNet, respectively, and simultaneously the proposed AI-Yolo yields better accuracy than above models. Nevertheless, the proposed AI-Yolo model consumes relatively large GLOPs (92.69 G), which is probably due to the large number of applied up- and down-sample operations in the FF module. Although sufficient multi-scale information fusion is realized, extra computational costs are introduced as well, thereby leading to large GLOPs. While according to Fig. 5, it is found that the proposed AI-Yolo model presents considerable convergence and the inference speed is even slightly faster than the original YOLOv3 model, which indicates an acceptable balance between the model complexity and detection accuracy of the proposed AI-Yolo model.
In Fig. 9, some visualization results of the developed AI-Yolo on the two datasets are displayed, where the first and second row present results on dataset WMD-1 and WMD-2, respectively. It should be highlighted that many mask detection scenes are complicated with the characteristics of high density, multiple disturbances and target confusion. Take Fig. 9(e) as an example, which contains various face targets with multiple occlusion types. Hence, the challenges of detection simultaneously include small object sizes, different face orientation and incomplete target display. According to the results, all targets in Fig. 9(e) are correctly identified with high confidence, which demonstrates that AI-Yolo can present excellent detection ability and strong robustness in extremely complicated situations.
4.4. Ablation study
In this subsection, ablation studies are carried out to further validate the effectiveness of each module in AI-Yolo, where the original YOLOv3 model is selected as the baseline. In particular, three variants of AI-Yolo are designed, which are YOLOv3-1 (YOLOv3SK), YOLOv3-2 (YOLOv3SKSPP) and YOLOv3-3 (YOLOv3SKSPPFF), respectively. It is noticeable that the distinction between YOLOv3-3 and AI-Yolo is the applied loss function (see Section 3.5). Other basic experimental settings remain unchanged, and results are reported in Table 5.
Table 5.
Models | mAP |
F1 |
||
---|---|---|---|---|
WMD-1 | WMD-2 | WMD-1 | WMD-2 | |
YOLOv3 | 0.878 | 0.827 | 0.757 | 0.689 |
YOLOv3-1 | 0.889 (1.1%) | 0.865 (3.8%) | 0.774 (1.7%) | 0.728 (3.9%) |
YOLOv3-2 | 0.911 (2.2%) | 0.879 (1.4%) | 0.846 (7.2%) | 0.769 (4.1%) |
YOLOv3-3 | 0.923 (1.2%) | 0.901 (2.2%) | 0.874 (2.8%) | 0.774 (0.5%) |
AI-Yolo (ous) | 0.941 (1.8%) | 0.907 (0.6%) | 0.893 (1.9%) | 0.786 (1.2%) |
In Table 5, symbol denotes the performance improvement. As can be seen, after adding the designed SK, SPP and FF module successively, there is a steady increase of both and F1 score on the two datasets as compared to the baseline model, which implies that above three modules do contribute to enhancing the model performance. To be specific, on WMD-1 dataset, SPP module has significant influences on both and F1 score; whereas on WMD-2 database, SPP module has shown the strongest positive effects on F1 score, while is mostly promoted by SK module. With the designed FF module and CIoU loss function are further introduced, the proposed AI-Yolo model obtains the best results on the two datasets in terms of both evaluation metrics, which verifies that the strategies adopted in AI-Yolo have played important roles in extracting local and global key features, which obtains rich receptive field information and promotes enhanced feature fusion.
Based on above discussions, the developed AI-Yolo is proven to have considerable performance in terms of both accurate recognition and precise localization, which is competent to detect face masks under complex circumstances. In future, we aim to (1) optimize the model structure from the aspect of system science [45] to pursue the balance between the computational complexity and detection accuracy; (2) employ some heuristic optimization algorithms [46] to realize an automatic hyper-parameter adjustment; (3) apply the proposed AI-Yolo model in other detection scenarios to further verify the practicality such as industrial defect inspection [47].
5. Conclusion
In this paper, a novel face mask detection framework AI-Yolo has been put forward, which is used for enhancing model performance in complex real-world environment with small targets and multiple occlusion interference. On the basis of YOLOv3 structure, three components have been designed and embedded, including SK kernel attention mechanism, SPP information enhancement method, and FF multi-scale feature fusion module. In brief, adaptive adjustment of convolution kernel sizes according to multi-scale input features has been achieved in SK module, and the followed SPP further has enriched receptive field information and enhanced feature expression ability after concatenation in each branch of the detection head. Finally, up- and down-sample with 1 × 1 and 3 × 3 convolution operations have been adopted in FF module to promote the effective fusion of enhanced multi-scale features without increasing the computational cost. In addition, CIoU loss has replaced the original loss function of YOLOv3 to achieve more accurate positioning. Performance of the developed AI-Yolo has been evaluated on two public face mask detection datasets, which contains complicated and challenging tasks in real-world scenarios. Experimental results have shown the superiority of the proposed AI-Yolo against other state-of-the-art models, and effectiveness of the core modules is validated through ablation studies as well.
CRediT authorship contribution statement
Hongyi Zhang: Conceptualization, Resources, Writing – review & editing. Jun Tang: Conceptualization, Methodology, Software, Validation, Investigation. Peishu Wu: Methodology, Software, Investigation, Writing – original draft. Han Li: Validation, Writing – original draft. Nianyin Zeng: Project administration, Methodology, Investigation, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Footnotes
This work was supported in part by the Natural Science Foundation of China under Grant 62073271, the Science and Technology Research Program of Chongqing Municipal Education Commission, China under Grant KJQN202001319, the National Science and Technology Major Project, China under Grant J2019-I-0013-0013, the Independent Innovation Foundation of AECC, China under Grant ZZCX-2018-017, and the Korea Foundation for Advanced Studies, South Korea .
Data availability
Data will be made available on request
References
- 1.Zhao X., Li X., Nie C. Backtracking transmission of COVID-19 in China based on big data source, and effect of strict pandemic control policy. Bull. Chin. Acad. Sci. 2020;35(03):248–255. [Google Scholar]
- 2.Szankin M., Kwasniewska A. Can AI see bias in X-ray images? Int. J. Netw. Dyn. Intell. 2022;1(01):48–64. [Google Scholar]
- 3.Yu N., Yang R., Huang M. Deep common spatial pattern based motor imagery classification with improved objective function. Int. J. Netw. Dyn. Intell. 2022;1(01):73–84. [Google Scholar]
- 4.Hao X., Zhang G., Ma S. Deep learning. Int. J. Semant. Comput. 2020;10(03):248–255. [Google Scholar]
- 5.Tao H., Tan H., Chen Q., Liu H., Hu J. State estimation for memristive neural networks with randomly occurring DoS attacks. Syst. Sci. Control Eng. 2022;10(01):154–165. [Google Scholar]
- 6.Ju Y., Tian X., Liu H., Ma L. Fault detection of networked dynamical systems: a survey of trends and techniques. Internat. J. Systems Sci. 2021;52(16):3390–3409. [Google Scholar]
- 7.Zhang H., Li Y., Guan W., Li J., Zheng J., Zhang X. The optical fringe code modulation and recognition algorithm based on visible light communication using convolutional neural network. Signal Process., Image Commun. 2019;75:128–140. [Google Scholar]
- 8.Benini S., Khan K., Leonardi R., Mauro M., Migliorati P. Face analysis through semantic face segmentation. Signal Process., Image Commun. 2019;74:21–31. [Google Scholar]
- 9.Cao Y., Fu G., Yang J., Cao Y., Yang M. Accurate salient object detection via dense recurrent connections and residual-based hierarchical feature integration. Signal Process., Image Commun. 2019;78:103–112. [Google Scholar]
- 10.Lu P., Song B., Xu L. Human face recognition based on convolutional neural network and augmented dataset. Syst. Sci. Control Eng. 2021;9(s2):29–37. [Google Scholar]
- 11.Girshick R., Donahue J., Darrell T., Malik J. 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation; pp. 248–255. [Google Scholar]
- 12.Girshick R. 2015 IEEE International Conference on Computer Vision. 2015. Fast R-CNN; pp. 1440–1448. [Google Scholar]
- 13.Ren S., He K., Girshick R., Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017;39(06):1137–1149. doi: 10.1109/TPAMI.2016.2577031. [DOI] [PubMed] [Google Scholar]
- 14.Cai Z., Vasconcelos N. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. Cascade R-CNN: delving into high quality object detection; pp. 6154–6162. [Google Scholar]
- 15.Pang J., Chen K., Shi J., Feng H., Ouyang W., Lin D. Libra R-CNN: towards balanced learning for object detection. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition; CVPR; 2019. pp. 821–830. [Google Scholar]
- 16.He K., Gkioxari G., Dollár P., Girshick R. Mask R-CNN. 2017 IEEE International Conference on Computer Vision; ICCV; 2017. pp. 2980–2988. [Google Scholar]
- 17.Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C., Berg A. Computer Vision – ECCV 2016. 2016. SSD: single shot multiBox detector; pp. 21–37. [Google Scholar]
- 18.Redmon J., Divvala S., Girshick R., Farhadi A. You Only Look Once: unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition; CVPR; 2016. pp. 779–788. [Google Scholar]
- 19.Redmon J., Farhadi A. YOLO9000: better, faster, stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition; CVPR; 2017. pp. 6517–6525. [Google Scholar]
- 20.Redmon J., Farhadi A. 2018. YOLOv3: an incremental improvement. arXiv:1804.02767, [online] Available: https://arxiv.org/abs/1804.02767. [Google Scholar]
- 21.Bochkovskiy A., Wang C., Mark Liao H. 2020. YOLOv4: Optimal speed and accuracy of object detection. arXiv:2004.10934, [online] Available: https://arxiv.org/abs/2004.10934. [Google Scholar]
- 22.Zhu X., Lyu S., Wang X., Zhao Q. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. 2021 IEEE/CVF International Conference on Computer Vision Workshops; ICCVW; 2021. pp. 2778–2788. [Google Scholar]
- 23.Gan F., Yang S., Jiang S. Detection of white blood cells using YOLOV3 network. Processing of 2019 14TH IEEE International Conference on Electronic Measurement & Instruments; ICEMI; 2019. pp. 1683–1688. [Google Scholar]
- 24.Zhao H., Zhou Y., Zhang L., Peng Y., Hu X., Peng H., Cai X. Mixed YOLOv3-LITE: a lightweight real-time object detection method. Sensors (Basel) 2020;20(07) doi: 10.3390/s20071861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chen Y., Hu M., Hua C., Zhai G., Zhang J., Li Q., Yang S. Face mask assistant: detection of face mask service stage based on mobile phone. IEEE Sens. J. 2021;21(09):11084–11093. doi: 10.1109/JSEN.2021.3061178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition; CVPR; 2016. pp. 770–778. [Google Scholar]
- 27.Lin T., Dollár P., Girshick R., He K., Hariharan B., Belongie S. Feature pyramid networks for object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition; CVPR; 2017. pp. 936–944. [Google Scholar]
- 28.Cao C., Yuan J. Mask-wearing detection method based on YOLO-mask. Laser Optoelectron. Prog. 2020;58(08):211–218. [Google Scholar]
- 29.Wang Y., Ding H., Li B., Yang Z., Yang J. Mask wearing detection algorithm based on improved YOLOv3 in complex scenes. Comput. Eng. 2020;46(11):12–22. [Google Scholar]
- 30.Zeng C., Yu J., Zhang Y. Improved YOLOv3 detection algorithm for mask wearing. Comput. Eng. Des. 2021;42(05):1455–1462. [Google Scholar]
- 31.Zhang L., Deng C. Multi-scale fusion of YOLOv3 crowd mask wearing detection method. Comput. Eng. Appl. 2021;57(16):283–290. [Google Scholar]
- 32.Wu P., Li H., Zeng N., Li F. FMD-Yolo: an efficient face mask detection method for COVID-19 prevention and control in public. Image Vis. Comput. 2022;117 doi: 10.1016/j.imavis.2021.104341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Sabir M., Mehmood I., Alsaggaf W., Khairullah E., Alhuraiji S., Alghamdi A., Abd E. An automated real-time face mask detection system using transfer learning with Faster-RCNN in the era of the COVID-19 pandemic. Comput. Mater. Contin. 2022;71(02) [Google Scholar]
- 34.Lin K., Zhao H., Lv J., Li C., Liu X., Chen R., Zhao R. Face detection and segmentation based on improved Mask R-CNN. Discrete Dyn. Nat. Soc. 2020;46:274–280. [Google Scholar]
- 35.Wang B., Zhao Y., Chen C. Hybrid transfer learning and broad learning system for wearing mask detection in the COVID-19 era. IEEE Trans. Instrum. Meas. 2021;70:1–12. doi: 10.1109/TIM.2021.3069844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li X., Wang W., Hu X., Yang J. Selective kernel networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition; CVPR; 2019. pp. 510–519. [Google Scholar]
- 37.He K., Zhang X., Ren S., Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015;37(09):1904–1916. doi: 10.1109/TPAMI.2015.2389824. [DOI] [PubMed] [Google Scholar]
- 38.Zheng Z., Wang P., Liu W., Li J., Ye R., Ren D. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 2019. Distance-IoU loss: faster and better learning for bounding box regression; pp. 12993–13000. no. 07. [Google Scholar]
- 39.Zheng Z., Wang P., Ren D., Liu W., Ye R., Hu Q., Zuo W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021:1–13. doi: 10.1109/TCYB.2021.3095305. [DOI] [PubMed] [Google Scholar]
- 40.Yang S., Luo P., Loy C., Tang X. WIDER FACE: a face detection benchmark. IEEE Conference on Computer Vision and Pattern Recognition; CVPR; 2016. pp. 5525–5533. [Google Scholar]
- 41.Ge S., Li J., Ye Q., Luo Z. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Detecting masked faces in the wild with LLE-CNNs; pp. 2682–2690. [Google Scholar]
- 42.Fu C., Liu W., Ranga A., Tyagi A., Alexander C.B. 2017. DSSD: deconvolutional single shot detector. arXiv:1701.06659, [online] Available: https://arxiv.org/abs/1701.06659. [Google Scholar]
- 43.Tan M., Pang R., Q L. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings. 2020. EfficientDet: scalable and efficient object detection; pp. 10778–10787. [Google Scholar]
- 44.Lin T., Goyal P., Girshick R., He K., Dollár P. Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision; ICCV; 2017. pp. 2999–3007. [Google Scholar]
- 45.Li H., Wu P., Zeng N., Liu Y., Alsaadi F.E. A survey on parameter identification, state estimation and data analytics for lateral flow immunoassay: from systems science perspective. Internat. J. Systems Sci. 2022 doi: 10.1080/00207721.2022.2083262. [online] Available: [DOI] [Google Scholar]
- 46.Li H., Li J., Wu P., You Y., Zeng N. A ranking-system-based switching particle swarm optimizer with dynamic learning strategies. Neurocomputing. 2022;494:356–367. [Google Scholar]
- 47.Zeng N., Wu P., Wang Z., Li H., Liu W., Liu X. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans. Instrum. Meas. 2022;71 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data will be made available on request