4.2. YOLOX Network
YOLOX is one of the most popular one-stage anchor-free object detection methods because both large and small objects without anchors are detected. For a moving vehicle, the boxes from a distance are obtained by using the roadside camera, which will drastically alter the box and increase the applicability of the anchor-free detection method. In addition, for the anchor-base, a single anchor box may correspond to multiple IDs, or multiple anchor boxes may correspond to a single ID, introducing a great deal of ambiguity during the training of Re-ID features, which is not optimal for training the model. In addition, excellent detection accuracy is provided by YOLOX. Despite the fact that object detection focuses on acquiring inter-class information and Re-ID focuses on differentiating inter-class information, there is a conflict between the two tasks that makes learning them simultaneously challenging. However, a more precise box for the detected sample produces a higher detection accuracy, which can result in a higher Re-ID accuracy.
Afterward, two tasks, vehicle detection and vehicle Re-ID, are simultaneously preformed. The YOLOX detection head is used for detection, and excellent accuracy is achieved. The two designed modules are then added to the vehicle search task to complete it. The specific network model framework is shown in
Figure 4.
4.3. Detection Branch
In object detection, classification and regression tasks frequently conflict with each other, which is a well-known issue [
31]. In this section, the YOLOX detection head is employed. The detection head is set to a decoupled structure, and the regression and classification are output separately, which significantly accelerates the model’s convergence.
- (1)
Creating Corresponding Alignment Entities
In the original YOLOX model, different levels of features are used to detect objects of different sizes, which significantly improves the detection accuracy. However, for the Re-ID task, because the Re-ID features obtained at different stages are distinct, there are different background features, which have a significant impact on the learned discrimination ability. The complexity of the model and the slowdown in training are also increased by using multiple stages for detection, neither of which is conducive to the subsequent Re-ID task. Even though the low-order feature has less semantic information, sufficient location information is contained. Therefore, the detection framework based on FPN [
32] is modified, low-order and high-order features are combined, and detection with a single detection head is performed. The structure of the detection head is shown in
Figure 5.
To connect the two parts laterally, the
feature network from the Resnet-50 backbone is utilized, and then each stage is upsampled to obtain the
feature network. Here, a 3 × 3 deformable convolution is employed, which can better adapt and adjust the receptive field on the input feature map to produce increasingly precise feature maps.
where two 1 × 1 convolutions are used at
and 3 × 3 convolutions are used at
.
is represented as the concatenation of two features for improved multi-level feature aggregation. In order to achieve a good balance between the performance of the two subtasks of detection and Re-ID, the largest feature generated at
is only used for detection, ignoring a certain detection performance. The specific results are detailed in
Section 4.3.
- (2)
Detection Loss Calculation
GIoUloss is used to calculate the confidence
IoUloss when calculating the detection branch’s loss.
where
and
are boxes for calculating
, and C is the outermost box of
and
.
BCEloss (Binary CrossEntropy loss) is utilized by the detection box position loss,
Objloss, and the classification loss,
Clsloss.
where
is represented as the model output value, whose size must be between 0 and 1, and
is represented as the real label.
4.4. Re-ID Branch
As part of class-based feature comparison, the Re-ID branch is used to extract more discriminative features between vehicles. To accomplish this objective, two modules are designed that address the Re-ID branch separately.
- (1)
Camera Grouping Module
Typically, the dataset for a search task is collected from multiple cameras. Due to the use of multiple cameras, multiple perspectives of the same vehicle can be obtained. However, due to the varying installation positions of the cameras, the pictures they capture will result in significant differences in color, saturation, and brightness. As a result, a camera embedding module is proposed that employs camera ID for simple grouping and imparts camera information into features for aggregation in order to distinguish internal differences between cameras. The insertion position of the camera grouping module is shown in
Figure 4.
Specifically, the dataset contains
cameras, denoted as
. To initialize the module, a randomly generated sequence is utilized. Following initialization, the camera embedding is obtained as
, where
, and
H and
W are represented as the height and width of the corresponding image in the current
channel, respectively. The corresponding camera embedding feature for a photo
captured by a camera
can therefore be expressed as
. The camera embedding feature
is passed to the backbone, and the following expression is obtained:
where
is represented as an initial backbone feature and
is a balancing module hyperparameter, and when
, the effect is the best. Through the incorporation of modules, camera clustering is completed to minimize the impact of camera differences.
- (2)
Cross-level Feature Extraction Module
The vehicle’s center point coordinates (x, y) are obtained through detection, and then the object Re-ID feature centered at (x, y) is extracted from the feature map to obtain the vehicle’s frame feature. After observing the majority of vehicle frames, the most distinctive features (logo, headlights, etc.) are centered. As shown in
Figure 6, as the receptive field expands, the vehicle’s distinguishing characteristics increase, but so does the amount of background information, which contains more difficult-to-distinguish information. A novel form of progressive central pooling is introduced to process extracted features hierarchically.
To implement the preceding statement, local characteristics must first be hierarchically set.
Figure 6 is focused on the initial pooling center region, which is followed by decreasing levels. In the context of hierarchical modules, the information contained in the vehicle’s features is increased from less to more, from concentrated to generalized, resulting in more generalized training. Assuming that the lower left corner is the origin of the image
, the circular center mask region
of the
region can be expressed as follows:
where
is represented as the radius on the kth circle. The extracted mask features are then utilized to reproject the features. The final Re-ID features are acquired.
- (3)
Re-ID Loss Calculation
The network is optimized by building global feature OIM loss [
12] (Online Instance Matching loss) and Triplet loss [
22]. OIM loss is a kind of loss proposed for pedestrian search tasks. Its role is to store all the feature centers that mark identities in a lookup table (LUT).
represents L D-dimensional feature vectors. In addition, a circular list is compiled of Q unlabeled identity features,
. The following formula is used to calculate the probability of identifying
as the identity with ID
based on the two vectors presented above:
where
T is represented as transpose. The objective of OIM is to minimize the expected probability of a negative logarithm:
Then, the commonly used triple loss function is added in Re-ID [
22] to distinguish the detailed features between classes, shorten the distance with the corresponding features stored in the LUT, and push the distance of the features outside the LUT to a great distance. After detection, first the candidate feature set is obtained, and then the ternary combination set
is set. Consequently, the triplet loss function
is as follows:
where
is represented as the anchor feature itself,
is represented as the positive sample feature with the same ID as an anchor, and
is the feature with a different ID than anchor. Finally, the Re-ID branch’s computational loss is as follows:
when
, the effect is the best.