3.2.1. Introduction to the Parallel Processing Design
Usually, after training DL models, we can obtain the location of a priori dynamic targets in the scene based on semantic segmentation methods. Therefore, in recent years, researchers have developed numerous DL-based methods for identifying dynamic target regions and consequently filtering dynamic features in visual odometry. However, these methods suffer from serious real-time problems.
Figure 3a shows that in most existing methods, dynamic region filtering is added to the tracking thread. Then, an additional semantic thread is created to acquire the a priori dynamic region. Although the two threads are running in parallel from the thread perspective, the tracking task is not really processed in parallel. Based on this idea, the actual workflow is as follows:
Step 1: Wait for the semantic thread to obtain the segmentation result of the current frame in the tracking thread.
Step 2: Filter all feature points in the a priori dynamic target mask area in the tracking thread.
Step 3: Compute the pose in the tracking thread based on the filtered projection points.
In this approach, the semantic thread only plays the role of sending semantic segmentation requests, while the tracking thread must wait for the segmentation task to be completed before filtering the dynamic feature points. Therefore, although the above method adds a semantic thread, in fact, the calculation of pose still needs to wait for the completion of the segmentation task.
Since the semantic segmentation method takes a long time, whereas feature extraction often takes very little time, the actual time consumed by the tracking thread is the sum of the semantic segmentation time (), the dynamic feature filtering processing time (), and the pose estimation time ().
Among the three, the semantic segmentation time is usually much longer than the sum of the latter two. Therefore, to a certain extent, even if we integrate the task of sending semantic segmentation requests in the semantic thread into the tracking thread, it will not have too much impact on the whole system time. In this article, we call this method sequential processing. In contrast, our proposed SOLO-SLAM enables parallel processing of the instance segmentation task and the pose calculation task.
Our proposed SOLO-SLAM is based on the motion model for tracking after initialization is completed. The motion model [
2] is implemented as follows:
Step 1: The pose transformation result obtained from the previous frame of the system is used as the initial value.
Step 2: Project map points to the current frame for matching.
Step 3: According to the matching results, graph optimization is performed based on the re-projection error to obtain the pose of the current frame.
As shown in
Figure 2a, sequential processing filters all the projection points in the a priori dynamic region based on the obtained mask. When tracking based on the motion model, our projection points are derived from the projection of map points. In essence, filtering the projection points in the a priori dynamic target mask can be transformed into identifying the dynamic map points. When the map points are dynamic, filtering can be achieved without using the corresponding projection points.
Therefore, we specifically design the parallel processing method shown in
Figure 2b. Compared to the sequential processing approach, the pose calculation for the current frame can be performed directly without waiting for the instance segmentation task. The workflow is as follows:
Step 1: Filter the feature points matched by the dynamic map points in the current frame in the tracking thread.
Step 2: Calculate the pose based on the filtered feature points in the tracking thread.
The difference between our method and the sequential processing method illustrated in
Figure 1 is that we do not need the semantic segmentation results for each frame. Therefore, the SLAM system can reduce the semantic segmentation time
in the tracking thread.
In the above parallel processing steps, when it is determined that the map points are dynamic points, the corresponding projection matching results are filtered. The dynamic state of map points is determined by the dynamic probability of map points. Therefore, the problem is further transformed into the updating of the map point probability in the semantic thread.
3.2.2. Dynamic State Update Implementation of Map Points
As shown in Algorithm 1, we show the specific process of the dynamic probability update in the semantic thread. Next, we present the details of the dynamic probability update and dynamic status update of map points.
Dynamic probability update: In the semantic thread, we only perform instance segmentation on representative frames (Key frames) and update the dynamic probability of map points according to the a priori dynamic target mask and the location relationship of projection points. The specific dynamic probability update strategy [
19] is as follows:
where
and
are the dynamic and static probability of the last update, respectively.
and
are the dynamic and static probability, respectively, of the current key frame after the update based on instance segmentation.
is the normalization coefficient.
is the update factor. We set
to 0.7. The sum of
and
is 1. When the projection point is in the a priori dynamic region,
is equal to
; when the projection point is outside the a priori dynamic region,
is equal to
.
In the updating process of dynamic probability, the algorithm involves the projection of map points. The projection here refers to the projection of a point in the world coordinate system to the pixel coordinate system. The relevant formula is as follows:
where
corresponds to the projection point on the pixel coordinate system.
corresponds to a point on the camera coordinate system.
corresponds to the map point on the world coordinate system.
corresponds to the internal parameter matrix of the camera.
corresponds to the transformation matrix from the world coordinate system to the camera coordinate system.
Dynamic status update: As shown in
Figure 4, we consider a map point as dynamic when its dynamic probability is greater than 0.75. In the tracking thread, we perform a preliminary filtering operation on the projection matching results of dynamic map points.
is the collection of all map points with matching relationships within the field of view of the current key frame to be optimized. is the set of projection positions of on the current key frame. and are member variables of map points, corresponding to semantic attribute probability and dynamic probability, respectively. is the collection of all semantic regions of the current key frame. and are the a priori dynamic and the a priori static regions of the current key frame, respectively. is the set of key frames awaiting instance segmentation.
As shown in Equation (1), the time consumed by the SLAM system for pose estimation completely excludes the effect of the instance segmentation consumption time by the design of parallel processing. However, the update of the dynamic probability of map points depends entirely on the instance segmentation results of key frames, which poses the following two problems.
The map points detected by key frames do not cover all map points generated by normal frames. Therefore, the map point probability update is not comprehensive.
The processing speed of normal frames in the tracking thread is higher than that of key frames in the semantic thread. Therefore, there is a certain lag in updating the probability of map points.
Compared to the sequential execution method, the above two shortcomings make it difficult to completely filter the projection points from dynamic targets during normal frame processing. Therefore, according to the characteristics of parallel processing methods, we introduce a secondary filtering method based on regional dynamic degree and geometric constraints in
Section 3.3.
Algorithm 1 Update dynamic probability and semantic properties of map points |
|