Scale Drift Correction of Camera Geo-Localization Using Geo-Tagged Images

Iwami, Kazuya; Ikehata, Satoshi; Aizawa, Kiyoharu

doi:10.1007/978-3-030-11009-3_16

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11129))

Included in the following conference series:

European Conference on Computer Vision

2539 Accesses

Abstract

Camera geo-localization from a monocular video is a fundamental task for video analysis and autonomous navigation. Although 3D reconstruction is a key technique to obtain camera poses, monocular 3D reconstruction in a large environment tends to result in the accumulation of errors in rotation, translation, and especially in scale: a problem known as scale drift. To overcome these errors, we propose a novel framework that integrates incremental structure from motion (SfM) and a scale drift correction method utilizing geo-tagged images, such as those provided by Google Street View. Our correction method begins by obtaining sparse 6-DoF correspondences between the reconstructed 3D map coordinate system and the world coordinate system, by using geo-tagged images. Then, it corrects scale drift by applying pose graph optimization over $\mathrm {Sim}(3)$ constraints and bundle adjustment. Experimental evaluations on large-scale datasets show that the proposed framework not only sufficiently corrects scale drift, but also achieves accurate geo-localization in a kilometer-scale environment.

You have full access to this open access chapter, Download conference paper PDF

Image based Localization under large perspective difference between Sfm and SLAM using split sim(3) optimization

Article Open access 11 February 2022

Wide baseline pose estimation from video with a density-based uncertainty model

Article 13 June 2019

Novel Approaches for Periodic Depth Enhancement in Visual SLAM

Keywords

1 Introduction

Camera geo-localization from a monocular video in a kilometer-scale environment is a essential technology for AR, video analysis, and autonomous navigation. To achieve accurate geo-localization, 3D reconstruction from a video is a key technique. Incremental structure from motion (SfM) and visual simultaneous localization and mapping (visual SLAM) achieve large-scale 3D reconstructions by simultaneously localizing camera poses with six degrees-of-freedom (6-DoF) and reconstructing a 3D environment map [7, 19].

Unlike for a stereo camera, an absolute scale of the real world cannot be derived using a single observation from a monocular camera. Although it is possible to estimate an environment’s relative scale from a series of monocular observations, errors in the relative scale estimation accumulate over time, and this is referred to as scale drift [6, 22].

For an accurate geo-localization not affected by scale drift, prior information in a geographic information system (GIS) has been utilized in previous studies. For example, point clouds, 3D models, building footprints, and road maps have been proven to be efficient for correcting reconstructed 3D maps [4, 5, 18, 23, 24]. However, these priors are only available in limited situations, e.g., in an area that is observed in advance, or in an environment consisting of simply-shaped buildings. Therefore, there is a good chance that other GIS information can help to extend the area in which a 3D map can be corrected.

Hence, in this paper, motivated by the recent availability of massive public repositories of geo-tagged images taken all over the world, we propose a novel framework for correcting the scale drift of monocular 3D reconstruction by utilizing geo-tagged images, such as those in Google Street View [1], and achieve accurate camera geo-localization. Owing to the high coverage of Google Street View, our proposal is more scalable than those in previous studies.

The proposed framework integrates incremental SfM and a scale drift correction method utilizing geo-tagged images. Our correction method begins by computing 6-DoF correspondences between the reconstructed 3D map coordinate system and the world coordinate system, by using geo-tagged images. Owing to significant differences in illumination, viewpoint, and the environment resulting from differences in time, it tends to be difficult to acquire correspondences between video frames and geo-tagged images (Fig. 2). Therefore, a new correction method that can deal with the large scale drift of a 3D map using a limited number of correspondences is required. Bundle adjustment with constraints of global position information, which represents one of the most important correction methods, cannot be applied directly. This is because bundle adjustment tends to get stuck in a local minimum when starting from a 3D map including large errors [22]. Hence, the proposed correction method consists of two coarse-to-fine steps: pose graph optimization over Sim(3) constraints, and bundle adjustment. In these steps, our key idea is to extend the pose graph optimization method proposed for the loop closure technique of monocular SLAM [22], such that it incorporates the correspondences between the 3D map coordinate system and the world coordinate system. This step corrects the large errors, and enables bundle adjustment to obtain precise results. After implementing this framework, we conducted experiments to evaluate the proposal.

The contributions of this work are as follows. First, we propose a novel framework for camera geo-localization that can correct scale drift by utilizing geo-tagged images. Second, we extend the pose graph optimization approach to dealing with scale drift using a limited number of correspondences to geo-tags. Finally, we validate the effectiveness of the proposal through experimental evaluations on kilometer-scale datasets.

2 Related Work

2.1 Monocular 3D Reconstruction

Incremental SfM and visual SLAM are important approaches to reconstructing 3D maps from monocular videos. Klein et al. proposed PTAM for small AR workspaces [11]. Mur-Artal et al. developed ORB-SLAM, which can reconstruct large-scale outdoor environments [19]. For accurate 3D reconstruction, the loop closure technique has commonly been employed in recent SLAM approaches [19, 22]. Loop closure deals with errors that accumulate between two camera poses that occur at the same location, i.e., when the camera trajectory forms a loop. Lu and Milios [16] formulated this technique as a pose graph optimization problem, and Strasdat et al. [22] extended pose graph optimization to deal with scale drift for monocular visual SLAM. It is certain that loop closure can significantly improve 3D maps, but this is only effective if a loop exists in the video.

2.2 Geo-Registration of Reconstructions

Correcting reconstructed 3D maps by using geo-referenced information has been regarded as a geo-registration problem. Kaminsky et al. proposed a method that aligns 3D reconstructions to 2D aerial images [10]. Wendel et al. used an overhead digital surface model (DSM) for the geo-registration of 3D maps [26]. Similar to our work, Wang et al. used Google Street View geo-tagged images and a Google Earth 3D model for the geo-registration of reconstructed 3D maps [25]. However, because all these methods focus on estimating a best-fitting similarity transformation to geo-referenced information, they only correct the global scale in terms of 3D map correction.

Methods for geo-registration using non-linear transformations have also been proposed. To integrate GPS information, Lhuillier et al. proposed incremental SfM using bundle adjustment with constraints from GPS [14], and Rehder et al. formulated a global pose estimation problem using stereo visual odometry, inertial measurements, and infrequent GPS information as a 6-DoF pose graph optimization problem [20]. In terms of correcting camera poses using sparse global information, Rehder’s method is similar to our pose graph optimization approach. However, our 7-DoF pose graph optimization differs in focusing on scale drift resulting from monocular 3D reconstruction, and in utilizing geo-tagged images. In addition to GPS information, various kinds of reference data have been used for the non-linear geo-registration or geo-localization of a video, such as point clouds [5, 18], 3D models [23], building footprints [24], and road maps [4]. In this paper, we address a method that introduces geo-tagged images to the non-linear geo-registration of 3D maps.

3 Proposed Method

Figure 1 provides a flowchart of the proposed framework, which is roughly divided into three parts. The first part is incremental SfM, and is described in Sect. 3.2. The second part computes 6-DoF correspondences between the 3D map coordinate system and the world coordinate system (as defined below), by making use of geo-tagged images (Sect. 3.3). The third part then uses the correspondences to correct the scale drift of the 3D map, by applying pose graph optimization over Sim(3) constraints (Sect. 3.5) and bundle adjustment (Sec. 3.6) incrementally. The initialization of the scale drift correction method is described in Sect. 3.4.

3.1 World Coordinate System

In this paper, the world coordinates are represented by 3D coordinates (x, y, z), where the xz-plane corresponds to the Universal Transverse Mercator (UTM) coordinate system, which is an orthogonal coordinate system using meters, and y corresponds to the height from the ground in meters. The UTM coordinates can be converted into latitude and longitude if necessary.

3.2 Incremental SfM

As large-scale incremental SfM, we use ORB-SLAM [19] (with no real-time constraints). This is one of the best-performing monocular SLAM systems. Frames that are important for 3D reconstruction are selected as keyframes by ORB-SLAM. Every time a new keyframe is selected, our correction method is performed, and the 3D map reconstructed up to that point is corrected. In the 3D reconstruction, we identify 3D map points and their corresponding 2D keypoints in the keyframes (collectively denoted by ).

Our proposed framework does not depend on a certain 3D reconstruction method, and can be applied to the other monocular 3D reconstruction methods, such as incremental SfM and feature-based visual SLAM.

3.3 Obtaining Correspondences Between 3D Map and World Coordinates

Here, we describe the second part of the proposed method, which uses geo-tagged images to compute a 6-DoF correspondence, , between the 3D map and world coordinate system. For this purpose, we modify Agarwal’s method [2] to integrate it into ORB-SLAM. This part consists of the following four steps: geo-tagged image collection, similar geo-tagged image retrieval, keypoint matching, and geo-tagged image localization.

Geo-Tagged Image Collection. Google Street View [1] is a browsable street-level GIS, which is one of the largest repositories of global geo-tagged images (i.e., images and their associated geo-tags). All images are high-resolution RGB panorama images, containing highly accurate world positions [12]. We make use of this data by converting each panorama image into eight rectilinear images with the same field-of-view as our input video, with eight horizontal directions. Note that because each geo-tag has a position and rotation in the world coordinates, we can obtain the 6-DoF correspondences between the 3D map coordinate system and world coordinate system if geo-tagged images are localized in the 3D map coordinate system.

Similar Geo-Tagged Image Retrieval. When a new keyframe is selected, we retrieve the top-k similar geo-tagged images. The retrieval system employs a bag-of-words approach based on SIFT descriptors [2].

Keypoint Matching. Given the pairs of keyframes and retrieved geo-tagged images, we detect ORB keypoints [21] from the pairs and perform keypoint matching. Because the matching between video frames and Google Street View images tends to include many outliers [17], we use a virtual line descriptor (kVLD) [15], which can reject outliers by using a graph matching method even when inlier rate is around 10.

Geo-Tagged Image Localization. To compute , we first compute 3D-to-2D correspondences between 3D map points and their corresponding 2D keypoints in geo-tagged images. In particular, we obtain by combining the 2D keypoint matches (computed in the previous step) with the correspondences between 3D map points and their corresponding 2D keypoints in keyframes (computed in 3D reconstruction). Then, we obtain the 6-DoF camera poses of geo-tagged images in the 3D map coordinate system by minimizing the re-projection errors of , using the LM algorithm. Finally, we obtain by combining the camera poses of geo-tagged images and 6-DoF camera poses of the associated geo-tags.

3.4 Initialization (INIT)

As the initialization, two kinds of linear transformations are performed on the 3D map, because the positions and scales of the 3D map coordinates and world coordinates are significantly different. Initialization is applied once, when the i-th geo-tagged image is localized. We set $i = 4$.

Given the first to i-th , the first transformation assumes that all camera poses are approximately located in one plane, and rotates the 3D map to align that plane to the world xz-plane. The best-fitting plane can be estimated by a principal component analysis.

Next, we estimate the best-fitting transformation matrix given by Eq. 1, which transforms a point in the 3D map coordinate system to be closer to a corresponding point in the world coordinate system $\mathbf {p}_{world,k}$ ( and $\mathbf {p}_{world,k}$ are denoted using a homogeneous representation):

$$\begin{aligned} \mathbf {A} = \begin{bmatrix} s * \cos (\theta )&0&-s * \sin (\theta )&a\\ 0&s&0&1 \\ s * \sin (\theta )&0&s * \cos (\theta )&b\\ 0&0&0&1 \end{bmatrix} \end{aligned}$$

(1)

Using the first to i-th , we estimate the four matrix parameters $[a, b, s, \theta ]$ by minimizing the following cost using RANSAC [8] and the Levenberg-Marquart (LM) algorithm:

(2)

The camera poses of the geo-tagged images in , keyframes, and 3D map point can then be transformed using the resulting matrix.

3.5 Pose Graph Optimization over Sim(3) Constraints (PGO)

We correct the 3D map focusing on scale drift by using the newest three of . This correction is performed every time a new is found after initialization. Then, we propose a graph-based non-linear optimization method (pose graph optimization) on Lie manifolds, which simultaneously corrects the scale drift and aligns the 3D map with the world coordinates.

Notation. A 3D rigid body transformation $\mathbf {G} \in \mathrm {SE}(3)$ and a 3D similarity transformation $\mathbf {S} \in \mathrm {Sim}(3)$ are defined by Eq. 3, where $\mathbf {R} \in \mathrm {SO}(3)$, $\mathbf {t} \in \mathbb {R}^3$, and $s \in \mathbb {R}^{+}$. Here, $\mathrm {SO}(3)$, $\mathrm {SE}(3)$, and $\mathrm {Sim}(3)$ are Lie groups, and $\mathfrak {so}(3)$, $\mathfrak {se}(3)$, and $\mathfrak {sim}(3)$ are their corresponding Lie algebras. A Lie group can be transformed into a Lie algebra using its exponential map, and the inverse transformation is defined by the inverse logarithm map. Each Lie algebra is represented by a vector of its coefficients. For example, $\mathfrak {sim}(3)$ is represented as the seven-vector $\varvec{\xi } = (\omega _{1},\omega _{2},\omega _{3}, \sigma , \nu _{1}, \nu _{2}, \nu _{3})^{\mathrm {T}} = (\varvec{\omega }, \sigma , \varvec{\nu })^{\mathrm {T}}$, and the exponential map $\exp _{\mathrm {Sim}(3)}$ and logarithm map $\log _{\mathrm {Sim}(3)}$ are defined as in Eqs. 4 and 5, respectively, where $\mathbf {W}$ is a term similar to Rodriguez’s formula. Further details of $\mathrm {Sim}(3)$ are given in [22].

$$\begin{aligned} \mathbf {G} = \begin{bmatrix} \mathbf {R}&\mathbf {t}\\ \mathbf {0}&1 \end{bmatrix}&\qquad \mathbf {S} = \begin{bmatrix} s \mathbf {R}&\mathbf {t}\\ \mathbf {0}&1 \end{bmatrix} \end{aligned}$$

(3)

$$\begin{aligned} \begin{aligned} \exp _{\mathrm {Sim}(3)}(\varvec{\xi })&= \begin{bmatrix} e^{\sigma } \exp _{\mathrm {SO}(3)}(\varvec{\omega })&\mathbf {W} \varvec{\nu } \\ \mathbf {0}&1 \end{bmatrix} = \mathbf {S} \end{aligned} \end{aligned}$$

(4)

$$\begin{aligned} \log _{\mathrm {Sim}(3)}(\mathbf {S}) = {\exp _{\mathrm {Sim}(3)}}^{-1}(\mathbf {S}) = \varvec{\xi } \end{aligned}$$

(5)

Proposed Pose Graph Optimization. In a general pose graph optimization approach [16, 20], camera poses and relative transformations between two camera poses are represented as elements of $\mathrm {SE}(3)$. However, in our approach, 6-DoF camera poses and relative transformations are converted into 7-DoF camera poses, represented by elements of $\mathrm {Sim}(3)$. This is achieved by leaving the rotation R and translation $\mathbf {t}$ of a camera pose unchanged, and setting the scale s to 1. The idea that camera poses and relative pose constraints can be handled in $\mathrm {Sim}(3)$ was proposed by Strasdat et al. [22], for dealing with the scale drift problem in monocular SLAM. In this paper, we introduce 7-DoF pose graph optimization, which has previously only been used in the context of loop closure, to correct 3D reconstruction by utilizing sparse correspondences between two coordinate systems. Our pose graph contains two kinds of nodes and three kinds of edges, as follows (see Fig. 3):

Node $\mathbf {S}_{n} \in \mathrm {Sim}(3)$, where $n \in C_{1} $: the camera pose of the $n^{th}$ keyframe.
Node $\mathbf {S}_{m} \in \mathrm {Sim}(3)$, where $m \in C_{2}$: the camera pose of the $m^{th}$ geo-tagged image.
Edge $\mathbf {e}_{1_{i, j}}$, where $(i, j) \in C_{3}$: the relative pose constraint between the $i^{th}$ and $j^{th}$ keyframes (Eq. 6).
Edge $\mathbf e _{2_{k,l}}$, where $(k, l) \in C_{4}$: the relative pose constraint between the $k^{th}$ keyframe and the $l^{th}$ geo-tagged image (Eq. 7).
Edge $\mathbf e _{3_{m}}$, where $m \in C_{2}$: the distance error between the position of the $m^{th}$ geo-tagged image and the world position $\mathbf {y}_{m}$ of the corresponding geo-tag (Eq. 8).

$$\begin{aligned} \mathbf e _{1_{i,j}}&= \log _{\mathrm {Sim}(3)}(\Delta \mathbf {S}_{i,j} \cdot \mathbf {S}_{i} \cdot \mathbf {S}_{j}^{-1}) \in \mathbb {R}^7 \end{aligned}$$

(6)

$$\begin{aligned} \mathbf e _{2_{k,l}}&= \log _{\mathrm {Sim}(3)}(\Delta \mathbf {S}_{k,l} \cdot \mathbf {S}_{k} \cdot \mathbf {S}_{l}^{-1}) \in \mathbb {R}^7\end{aligned}$$

(7)

$$\begin{aligned}&\mathbf e _{3_{m}} = \mathrm {trans}(\mathbf {S}_{m}) - \mathbf {y}_{m} \in \mathbb {R}^3 \end{aligned}$$

(8)

where $\mathrm {trans}(\mathbf {S}) \equiv (\mathbf {S}_{1,4}, \mathbf {S}_{2,4}, \mathbf {S}_{3,4})^{\mathrm {T}}$. Here, N is the total number of keyframes, and M is the total number of geo-tagged images that have correspondences to keyframes. The set $C_{1}$ contains all the keyframes positioned between the two that have the newest and the third newest . The set $C_{2}$ contains the newest three of . The set $C_{3}$ contains the pairs of keyframes that observe the same 3D map point in 3D reconstruction, and $C_{4}$ contains pairs of keyframes and their corresponding geo-tagged images. Finally, $\Delta \mathbf {S}_{i,j}$ is the converted $\mathrm {Sim}(3)$ relative transformation between $\mathbf {S}_{i}$ and $\mathbf {S}_{j}$, which is calculated before the optimization and remains fixed during the optimization.

Note that we newly introduced the nodes $\mathbf {S}_{m}$, edges $\mathbf e _{2_{k,l}}$, and edges $\mathbf e _{3_{m}}$ to Strasdat’s pose graph optimization. Minimizing $\mathbf e _{1_{i, j}}$ and $\mathbf e _{2_{k,l}}$ suppresses changes in the relative transformations between camera poses, with the exception of gradual scale changes. Minimizing $\mathbf e _{3_{m}}$ keeps the positions of the geo-tagged images close to the positions obtained from the associated geo-tags. Our overall cost function $E_{PGO}$ is defined as follows:

$$\begin{aligned} \begin{aligned} E_{PGO}(\bigl \{ \mathbf {S}_{i} \bigl \}_{i \in C_{1} \cup C_{2}})&= \lambda _{1}\sum _{(i,j) \in C_{3}} \mathbf e _{1_{i,j}}^{\mathrm {T}} \mathbf e _{1_{i,j}} \\&+ \lambda _{2}\sum _{(k,l) \in C_{4}} \mathbf e _{2_{k,l}}^{\mathrm {T}} \mathbf e _{2_{k,l}} + \lambda _{3}\sum _{m \in C_{2}} \mathbf e _{3_{m}}^{\mathrm {T}} \mathbf e _{3_{m}} \end{aligned} \end{aligned}$$

(9)

The corrected camera poses of keyframes $\mathbf {S}_{n}$ and geo-tagged images $\mathbf {S}_{m}$ are obtained by minimizing the cost function $E_{PGO}$ on Lie manifolds using the LM algorithm. Following this optimization, we also reflect this correction in the 3D map points, as in [22].

3.6 Bundle Adjustment (BA)

Following the pose graph optimization, we refine the 3D reconstruction by applying bundle adjustment with the constraints of the geo-tagged images. Bundle adjustment is a classic method that jointly refines the 3D structure and camera poses (and camera intrinsic parameters) by minimizing the total re-projection errors. Each re-projection error $\mathbf {r}_{i,j}$ between the $i^{th}$ 3D point and $j^{th}$ camera is defined as:

$$\begin{aligned} \mathbf {r}_{i,j} = \mathbf {x_{i}} - \pi (\mathbf {R}_{j} \mathbf {X}_{i} + \mathbf {t}_{j}) \end{aligned}$$

(10)

$$\begin{aligned} \pi (\mathbf {p}) = [ f_{x} \dfrac{\mathbf {p}_{x}}{\mathbf {p}_{z}} + c_{x},~ f_{y} \dfrac{\mathbf {p}_{y}}{\mathbf {p}_{z}} + c_{y} ]^{\mathrm {T}} \end{aligned}$$

(11)

where $\mathbf {X}_{i}$ is a 3D point and $\mathbf {x}_{i}$ is the 2D observation of that 3D point; $\mathbf {R}_{j}$ and $\mathbf {t}_{j}$ are the rotation and translation of the $j^{th}$ camera pose, respectively; $\mathbf {p} = [\mathbf {p}_{x}, \mathbf {p}_{y}, \mathbf {p}_{z}]^{\mathrm {T}}$ is a 3D point; $\pi (\cdot ): \mathbb {R}^3 \mapsto \mathbb {R}^2$ is the projection function; $(f_{x}, f_{y})$ is the focal length; and $(c_{x}, c_{y})$ is the center of projection.

To incorporate global position information of geo-tagged images with bundle adjustment, we add a penalty term corresponding to the constraint for a geo-tagged image [14]. The total cost function with this constraint is given by:

(12)

where $\mathbf {T}$ is a camera pose of a keyframe represented as an element of $\mathrm {SE}(3)$, $\rho $ is the Huber robust cost function, $C_{5}$ consists of map points observed by keyframes in $C_{1}$, and $C_{1}$ and $C_{3}$ are defined in Sect. 3.5. Both the positions of 3D points and the camera poses of keyframes are optimized by minimizing the cost function on Lie manifolds using the LM algorithm. This step can potentially correct the 3D map more precisely when it starts from a reasonably good 3D map.

4 Experiments

In this section, we evaluate the proposed method on the Málaga dataset [3], using geo-tagged images obtained from Google Street View. We also investigate the performance of pose graph optimization and bundle adjustment using the KITTI Dataset [9].

4.1 Implementation

We obtained geo-tagged images from Google Street View at intervals of 5 m within the area where the video was captured. We set the cost function weights to $\lambda _{1} = \lambda _{2} = 1.0 \times {10}^{5}$ and $\lambda _{3} = 1.0$, and we employed the g2o library [13] for the implementation of the pose graph optimization and bundle adjustment.

4.2 Performance of the Proposed Method

To verify the practical effectiveness of the proposed method, we evaluate it on the Málaga dataset using geo-tagged images obtained from Google Street View.

The Málaga Stereo and Laser Urban Data Set (the Málaga dataset) [3]—a large-scale video dataset that captures Street-View-usable areas—is employed in this experiment. The Málaga dataset contains a driving video captured at a resolution of $1024\,\times \,768$ at 20 fps in a Spanish urban area. We extracted two video clips (video 1 and video 2) from the video, and used these for the evaluation. The two video clips contain no loops, and their trajectories are over 1 km long. All frames in the videos contain inaccurate GPS positions, which are sometimes confirmed to contain errors of more than 10 m. Because of the inaccuracies, we manually assigned the ground truth positions to some selected keyframes by referring to the videos, inaccurate GPS positions, and Google Street View 3D Map. Figure 4 presents an example of inaccurate GPS data and our assigned ground truth. Because the ground truth positions are assigned by taking into account the lane from which the video was taken, the errors in the ground truth are considered to be within 2 m, and these errors are sufficiently small for this experiment.

We evaluated the proposed method on the two videos by comparing the proposal and a baseline method that uses a similarity transformation (like a part of [25]). For the baseline method, we apply the initialization (INIT: described in Sect. 3.4) without applying pose graph optimization and bundle adjustment. We did not employ a global similarity transformation as a baseline because it cannot be applied until the end of the whole 3D reconstruction.

To evaluate the proposed method quantitatively, we considered the average (Ave) and standard deviation (SD) of 2D distances between the ground truth positions and corresponding keyframe positions in the UTM coordinate system (in meters).

Table 1 presents the quantitative results, and Fig. 5 visualizes the results on Google Maps. As is clearly shown in these results, the baseline results accumulate scale errors, resulting in large errors of over 50 m. This is because the trajectories of these videos are long (greater than 1 km) and contain no loops. The proposed method sufficiently corrects scale drift, and significantly improves the 3D map by using geo-tagged images. In (b) and (e) of the visualized results, the 3D map points corrected using the proposed method are projected onto Google Maps, and it is shown that the 3D map points are correctly aligned to the map. To visualize all the correspondences between the 3D map coordinate system and the world coordinate system used in the proposal, we present the correspondences between the positions of geo-tagged images transformed by initialization and the positions of the corresponding geo-tags. These correspondences are employed incrementally for the correction.

Table 1. Results of our proposed method on the Málaga dataset using Google Street View.

Full size table

4.3 Performance of PGO and BA

To investigate the performance of the pose graph optimization and the bundle adjustment in our proposed method, we evaluated the performance using different combinations of these when varying the interval of .

Through the previous experiment, we found that the geo-tag location information of Google Street View and the manually assigned ground truths of the Málaga dataset occasionally had errors of several meters. In this experiment, we control the interval of , and use high-accuracy ground truths and geo-tags by using the KITTI dataset. The odometry benchmark of KITTI dataset [9] contains 11 sequences of stereo videos and precise location information obtained from RTK-GPS/IMU, and unfortunately Google Street View is not available in Germany where this dataset was captured. The experiment was conducted on two sequences, which include the largest and second-largest errors when applying ORB-SLAM: sequences 02 and 08 (containing 4660 and 4047 frames, respectively). The left images of the stereo videos are used as input, and pairs of a right image and location information are identified as geo-tagged images. All the location information associated with keyframes is used as the ground truth. In this experiment with KITTI dataset, we can compare the performances of correction methods accurately for the following reasons: geo-tag information and ground truths are sufficiently precise (open sky localization errors of RTK-GPS/IMU < 5 cm); and errors in geo-tagged image localization are sufficiently small, because keypoint matching between corresponding left and right images performs very well.

For the comparison, we present the results of the methods employing the initialization + the pose graph optimization (INIT+PGO), and initialization + the bundle adjustment (INIT+BA). The correction method of INIT + BA is the same as [14], which is often used with a GPS location information. Ours includes the initialization, the pose graph optimization and the bundle adjustment. We changed the interval of geo-tagged images from 100 frames to 500 frames. For an equal initialization, we set geo-tagged images in the interval of 50 frames from the first to the $200^{th}$ frame.

Table 2. Results of the experiments on the KITTI dataset: sequences 02 and 08. Values denote average 2D errors between ground truth positions and the corresponding keyframe positions [m]. Ours consists of INIT, PGO, and BA.

Full size table

Figure 6 visualizes the ground truth and keyframe trajectories estimated by INIT+BA, INIT+PGO, and Ours when the interval of geo-tagged images is 300 frames. Table 2 presents the quantitative results of the experiment, where the values represent the average 2D errors between ground truth positions and the corresponding keyframe positions in the UTM coordinate system (in meters). Moreover, we report the errors of the global linear transformation on the sequence 02 and 08 by aligning the keyframe trajectory obtained by ORB-SLAM with ground truths through a similarity transformation: 20.15 and 25.12, respectively. The results show that bundle adjustment with geo-tag constraints, which is typically employed in the fusion of 3D reconstruction and GPS information [14], is not suitable when the interval of is large. It can also be seen that Ours (the combination of initialization, pose graph optimization, and bundle adjustment) often estimates the keyframe positions more accurately than any other method.

4.4 Scale Drift Correction

To confirm that scale drift is corrected incrementally, we visualize the change in scale factor of the proposed method on the KITTI dataset sequences 02 and 08. Figure 7 shows that ORB-SLAM with the initialization accumulates scale errors, and our method can keep the scale factor around 1.

5 Conclusion

In this paper, we propose a novel framework for camera geo-localization that can correct scale drift by utilizing massive public repositories of geo-tagged images, such as those provided by Google Street View. By virtue of the expansion of such repositories, this framework can be applied in many countries around the world, without requiring the user to observe an environment. The framework integrates incremental SfM and a scale drift correction method utilizing geo-tagged images. In the correction method, we first acquire sparse 6-DoF correspondences between the 3D map coordinate system and the world coordinate system by using geo-tagged images. Then, we apply pose graph optimization over $\mathrm {Sim}(3)$ constraints and bundle adjustment. Our experiments on large-scale datasets show that the proposed framework sufficiently improves the 3D map by using geo-tagged images.

Note that our framework not only corrects the scale drift of 3D reconstruction, but also accurately geo-localizes a video. Our results are no less accurate than those of mobile devices (between 5 and 8.5 m) that use a cellular network and low-cost GPS [27], and those using monocular video and road network maps [4] (8.1 m in the KITTI sequence 02 and 45 m in sequence 08). This implies that geo-localization using geo-tagged images is sufficiently useful compared with methods using other GIS information.

References

Google street view. https://www.google.com/streetview/
Agarwal, P., Burgard, W., Spinello, L.: Metric localization using Google street view. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3111–3118. IEEE (2015)
Google Scholar
Blanco-Claraco, J.L., Moreno-Dueñas, F.Á., González-Jiménez, J.: The málaga urban dataset: high-rate stereo and LiDAR in a realistic urban scenario. Int. J. Robot. Res. 33(2), 207–214 (2014)
Article Google Scholar
Brubaker, M.A., Geiger, A., Urtasun, R.: Map-based probabilistic visual self-localization. IEEE Trans. Pattern Anal. Mach. Intell. 38(4), 652–665 (2016)
Article Google Scholar
Caselitz, T., Steder, B., Ruhnke, M., Burgard, W.: Monocular camera localization in 3D LiDAR maps. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1926–1931. IEEE (2016)
Google Scholar
Clemente, L.A., Davison, A.J., Reid, I.D., Neira, J., Tardós, J.D.: Mapping large loops with a single hand-held camera. In: Robotics: Science and Systems, vol. 2 (2007)
Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Chapter Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Article MathSciNet Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361 (2012)
Google Scholar
Kaminsky, R.S., Snavely, N., Seitz, S.M., Szeliski, R.: Alignment of 3D point clouds to overhead images. In: CVPR Workshops 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 63–70 (2009)
Google Scholar
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, ISMAR 2007, pp. 225–234. IEEE (2007)
Google Scholar
Klingner, B., Martin, D., Roseborough, J.: Street view motion-from-structure-from-motion. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 953–960 (2013)
Google Scholar
Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g2o: a general framework for graph optimization. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 3607–3613. IEEE (2011)
Google Scholar
Lhuillier, M.: Incremental fusion of structure-from-motion and GPS using constrained bundle adjustments. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2489–2495 (2012)
Article Google Scholar
Liu, Z., Marlet, R.: Virtual line descriptor and semi-local matching method for reliable feature correspondence. In: British Machine Vision Conference 2012, p. 16-1 (2012)
Google Scholar
Lu, F., Milios, E.: Globally consistent range scan alignment for environment mapping. Auton. Robots 4(4), 333–349 (1997)
Article Google Scholar
Majdik, A.L., Albers-Schoenberg, Y., Scaramuzza, D.: MAV urban localization from Google street view data. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3979–3986. IEEE (2013)
Google Scholar
Middelberg, S., Sattler, T., Untzelmann, O., Kobbelt, L.: Scalable 6-DOF localization on mobile devices. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 268–283. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_18
Chapter Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Article Google Scholar
Rehder, J., Gupta, K., Nuske, S., Singh, S.: Global pose estimation with limited GPS and long range visual odometry. In: 2012 IEEE International Conference on Robotics and Automation (ICRA), pp. 627–633 (2012)
Google Scholar
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2564–2571. IEEE (2011)
Google Scholar
Strasdat, H., Montiel, J., Davison, A.J.: Scale drift-aware large scale monocular SLAM. In: Robotics: Science and Systems VI (2010)
Google Scholar
Tamaazousti, M., Gay-Bellile, V., Collette, S.N., Bourgeois, S., Dhome, M.: Nonlinear refinement of structure from motion reconstruction by taking advantage of a partial knowledge of the environment. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3073–3080. IEEE (2011)
Google Scholar
Untzelmann, O., Sattler, T., Middelberg, S., Kobbelt, L.: A scalable collaborative online system for city reconstruction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 644–651 (2013)
Google Scholar
Wang, C.P., Wilson, K., Snavely, N.: Accurate georegistration of point clouds using geographic data. In: 2013 International Conference on 3DTV-Conference, pp. 33–40 (2013)
Google Scholar
Wendel, A., Irschara, A., Bischof, H.: Automatic alignment of 3D reconstructions using a digital surface model. In: 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 29–36. IEEE (2011)
Google Scholar
Zandbergen, P.A., Barbeau, S.J.: Positional accuracy of assisted GPS data from high-sensitivity GPS-enabled mobile phones. J. Navig. 64(3), 381–399 (2011)
Article Google Scholar

Download references

Acknowledgement

This work was partially supported by VTEC laboratories Inc.

Author information

Authors and Affiliations

The University of Tokyo, Tokyo, Japan
Kazuya Iwami & Kiyoharu Aizawa
National Institute of Informatics, Tokyo, Japan
Satoshi Ikehata

Authors

Kazuya Iwami
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Ikehata
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoharu Aizawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuya Iwami .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Iwami, K., Ikehata, S., Aizawa, K. (2019). Scale Drift Correction of Camera Geo-Localization Using Geo-Tagged Images. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11129. Springer, Cham. https://doi.org/10.1007/978-3-030-11009-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-11009-3_16
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11008-6
Online ISBN: 978-3-030-11009-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics