Keywords

1 Introduction

Humans and robots that co-exist in unstructured environments play a key role in bringing robotic assistance to the mainstream. We have seen unprecedented improvements in the field of autonomous mobile robot navigation in the presence of static and dynamic obstacles in the scene. This research has paved the way for industries, such as autonomous driving to move from proof-of-concept to field-able systems. While the progress has been significant, several challenges remain that provide unique opportunities to increase the reliability, safety and performance of autonomous mobile platforms to human level performance.

Here, we describe an approach centered around making predictions of the environment as a critical element for mobile robot navigation in highly dynamic and cluttered spaces. Our underlying assumption is that predictions of spatial structures and other agents in the scene play a vital role for humans to successfully navigate in similarly highly dynamic and dense environments. Specifically, we believe prediction allows humans to plan trajectories that extend beyond their visual line-of-sight as well as anticipate the movement other humans in the environment. This suggests errors in prediction can capture ‘surprise’ events that enable humans to alter the risk of their own internal navigation policy.

Here, we summarize and extend the works presented in [1, 2] by providing mobile agents with a version of predictive capabilities. We develop techniques that use generative neural networks that leverage spatial structures as a prior for making map predictions that extend beyond the robot’s line-of-sight. We extend this to include map prediction with multiple hypotheses and an information-theoretic robot exploration algorithm. Finally, we extend prediction to include pedestrian motion and develop a risk sensitive control algorithm for navigating around humans. Our objective is to show prediction is a key enabler that leads to more efficient and robust mobile navigation policies.

The remainder of this paper is organized as follows. In Sect. 2, we provide a brief overview of existing work in prediction and robot navigation in human crowds. In Sect. 3, we compare and contrast several approaches to occupancy map prediction. In Sect. 4, we extend approaches to occupancy map prediction to include generation of multiple hypotheses for efficient robot exploration. In Sect. 5, we extend prediction to include pedestrians for adaptive crowd navigation.

2 Related Work

2.1 Deep Learning for Generative Models

Deep neural networks have been used in a number of promising ways to achieve high performance in domains such as vision, speech and more recently in robotics manipulation [3, 4]. Oh et. al. used feedforward and recurrent neural networks to perform action-conditional video prediction using Atari games with promising results [5]. These have also been used in image completion, e.g., by Ulyanov et al. [6]. In addition, GANs have demonstrated a promising method for image generation [7]. Isola et al. proposed an approach for training conditional GANs which create one image from another image [8].

2.2 Deep Learning for Navigation

More recently, several papers have described approaches to combine elements of deep neural networks with autonomous navigation. These include using deep neural networks for model predictive control [3]. Tamar et al. proposed Value Iteration Networks, which embed a planner inside a deep neural net architecture [9]. Several papers investigate the use of deep reinforcement learning to develop collision-free planning without the need of an internal map, however, these are still restricted by the sensor’s FOV [10, 11].

2.3 Crowd Navigation

Crowd navigation has been studied extensively in recent literature (e.g., [12]) and can be classified into three general areas: (1) algorithms that react to moving obstacles in real time, (2) trajectory based approaches that plan paths by anticipating future motion of obstacles, and (3) reinforcement learning based approaches that learn a policy to navigate in crowded environments. Reaction based methods include works such as reciprocal velocity obstacles (RVO) [13] and optimal reciprocal collision avoidance (ORCA) [14]. Trajectory based approaches, such as [15, 16], explicitly propagate estimates of future motion over time and perform trajectory optimization on those future states for collision avoidance. Additionally, several recent works use variations of reinforcement learning to learn policies capable of crowd navigation (e.g., [17,18,19]). Everett et al. [18] developed a decentralized approach to multiagent collision avoidance using a value network that estimates the time to goal for a given state transition. Chen et al. [19] further extended this work by adding an attention mechanism and a novel pooling method to handle a variable number of humans in the scene. Kahn et al. [20] investigates adaptive navigation polices based on uncertainty; however, they only considered environment uncertainty with static obstacles and not navigation in the presence of pedestrians.

3 Occupancy Map Prediction

A major limitation of existing robot navigation algorithms is the limited field-of-view of traditional sensors. This is in contrast with biological systems that routinely make predictions of their environment. For example, imagine walking through a hallway or corridor. Often, we can predict spaces around a corner because we have seen similar hallways and corridors in the past. This prediction allows us to generate trajectories that extend beyond our line-of-sight. In this section, we describe an approach that allows robot systems to generate predicted occupancy maps that extend beyond the sensor’s line-of-sight. We evaluate several neural network architectures, loss functions and datasets to determine what approaches generate most accurate predicted occupancy maps. Formally, we are attempting to learn a function that maps an input occupancy map to an expanded occupancy map that extends beyond the FOV of the sensor.

$$ f : x \rightarrow y_i $$

where \(x\) represents the state, in this case, the input occupancy map as an image, \(y_i\) represents the output occupancy map and \(i\in \mathbb {R}\) represents percent increase of the expanded occupancy map. Components of the function f include an encoding function \(f_{enc}(x)\rightarrow h\in \mathcal {H}\) which maps the state space, input occupancy maps to a hidden state and \(f_{dec}(h) \rightarrow (y_i)\), which is a decoding function mapping the hidden state to an expanded, predicted occupancy map.

In our experiments, we compare several different neural network architectures including:

  1. (A)

    a feedforward network based on a U-Net architecture (unet_ff)

  2. (B)

    a feedforward network based on the ResNet architecture (resnet_ff)

  3. (C)

    a GAN using the feedforward network from (a) as the generative network (gan)

3.1 U-Net Feedforward Model

The U-Net feedforward model is based on the network architecture defined by Ronneberger et al. [21] and consists of skip connections which allows a direct connection between layers \(i\) and \(n-i\) enabling the option to bypass the bottleneck associated with the downsampling layers in order to perform an identity operation. Similar to [8], the encoder network consists of 8 convolution, batch normalization and ReLU layers where each convolution consists of a \(4 \times 4\) filter and stride length of 2. The number of filters for the 8 layers in the encoder network are: (64, 128, 256, 512, 512, 512, 512, 512). The decoder network consists of 8 upsampling layers with the following number of filters: (512, 1024, 1024, 1024, 1024, 512, 256, 128).

3.2 ResNet Feedforward Model

The ResNet feedforward model is based on the work by Johnson et al. [22] which consists of 2 convolution layers with stride 2, 9 residual blocks as defined by [23] and two deconvolution layers with with a stride of \( \frac{1}{2} \). A key reason this network was selected was based on the ability to learn identify functions, which is key to image translation as well as the success in image-to-image translation demonstrated by the CycleGAN network [24].

3.3 GAN Model

The GAN networks is based on the pix2pix architecture [8] which has demonstrated impressive results in general purpose image translation including generating street scenes, building facades and aerial images to maps. This network uses the U-Net Feedforward model defined in Sect. 3.1 and consists of a 6 layer discriminator network with filter sizes: (64, 128, 256, 512, 512, 512).

Fig. 1.
figure 1

This figure shows qualitative sample map predictions using the unet architecture compared to the ground truth map.

3.4 Occupancy Map Prediction Experiment and Results

Our approach to testing occupancy map prediction using the networks defined above first involved generating a dataset and then performing qualitative and quantitative analysis of the predicted images compared to the ground truth.

A dataset of approximately 6000 images of occupancy map subsets was created by simulating a non-holonomic robot moving through a two-dimensional map with a planar LIDAR sensor in C++ with ROS and the OctoMap library [25]. Two maps were created in Solidworks with the path width varying between 3.5 m to 10 m. These were converted into OctoMap’s binary tree format using binvox [26, 27] followed by OctoMap’s binvox2bt tool. The result is an occupancy map with all unoccupied space set as free. We require space outside of the walls to be marked as unknown to provide a ground truth for our estimated maps. These ground truth maps were created by fully exploring the original occupancy maps (Fig. 1).

The robot is modeled as a Dubin’s car, with a state vector \(\mathbf {x} = [x, y, \theta ]\) and inputs \(\mathbf {u} = [v, \dot{\theta }]\) where (xy) is the robot’s position, v is the velocity, and \(\theta \) and \(\dot{\theta }\) are the heading angle and angular velocity, respectively. For simplicity, the robot is constrained to move at fixed forward velocity of 0.5 m/s. A planar LIDAR sensor with a scanning area of 270\(^\circ \) and range of 20 m is used to simulate returns given the robot’s current pose against the ground truth map. These simulated returns are used to create the “estimated” occupancy map. Path planning is done with nonlinear model-predictive control and direct transcription at 10 Hz. At each time step, a subset of the maps (both the estimated and ground truth) are saved. A 5 m by 5 m square centered around the robot’s pose was chosen with a resolution of 0.05 m. At each time step, the robot’s current state and action space are also logged. Occupancy maps are expanded over time, so our simulation performs a continuous trajectory and the data set is built consecutively instead of randomly sampling throughout a map. A total of six trajectories were simulated. Four paths were used for training data (5221 images) and two were used as a test set (1090 images). Ground truth datasets of the expanded occupancy maps were also generated. These expanded occupancy maps range from 1.10x to 2.00x expansion in increments of 0.10x, e.g., a 2.00x expansion results in a 10 m by 10 m square subset centered around the robot.

Table 1. SSIM analysis for simulation data

We trained each variant of the neural network using the expanded ground truth occupancy maps from scratch for 200 epochs with a batch size of 1. A total of 15 training sessions were performed to evaluate each of the three neural network architectures across five expansion increases (1.10x, 1.30x, 1.50x, 1.70x, and 2.00x). We use the Adam optimizer with an initial learning rate of 0.0002 and momentum parameters \(\beta _1 = 0.5, \beta _2 = 0.999\). In the feedforward models, L1 loss was used as proposed in PatchGan [8] and in the GAN model L1+discriminator loss was used. The decoder layers of the network used a dropout rate of 0.50 and weights were initialized from a Normal distribution (\(\mu =0, \sigma =0.2\)). All models were implemented using PyTorch [28].

Table 1 provides the structural similarity index metric (SSIM) for each of the networks. Based on the SSIM metric, it can be seen that the U-Net feedforward model outperforms the other networks at 1.10x and 1.30x expansion confirming the qualitative assessment. The quality of the prediction generally decreases as the expansion percentage increases and with expansions 1.50x and above the three networks achieve similar performance. This leads us to believe skip connections across layers of an autoencoder network play a critical role in generating accurate predictions of future maps.

4 Uncertainty-Aware Occupancy Map Prediction

In the previous section, we explored properties of neural network architectures, loss functions and datasets used to validate the approach of occupancy map prediction. We now extend our neural network architectures to include generation of multiple hypotheses to measure the uncertainty of the predictions [2]. To capture this uncertainty, we modify the single hypothesis network to output multiple hypotheses predictions. We do this by branching N heads with each head capable of making its own prediction. The loss function is inspired by [29] to become a weighted sum, \(1-\epsilon \), of the best performing head loss and the weighted sum, \(\epsilon / (N-1)\), of the losses of the other heads. In our experiments, we set \(\epsilon = 0.05\).

Fig. 2.
figure 2

(a) Input occupancy map based on lidar’s FOV, (b) Prediction using single hypothesis resulting in blurred image, (c) and (d) 2 hypotheses generated using multiple hypotheses prediction, (e) Image representing variance between the multiple hypotheses

We show that measuring the uncertainty allows us to develop an information-theoretic exploration policy that drives the robot towards regions of high uncertainty. We modify the unet_ff architecture described in Sect. 3 to generate multiple hypotheses. By forming multiple hypotheses, we can compute the differences across the images to estimate regions of highest uncertainty. We first evaluate the benefits of making multiple predictions compared to single predictions using real world data sets provided by Google Cartographer [30] as described in Fig. 2. Here, even the average structural similarity index and peak signal to noise ratios improved when generating multiple predictions (Table 2).

Table 2. Quantitative analysis with multiple hypotheses prediction

We next evaluate our information-theoretic exploration policy. We use the Gazebo simulation environment where the goal of the robot is to efficiently explore the environment to generate a map. Our approach is to drive the robot towards regions with highest uncertainty and compare to existing map exploration algorithms. As an example, in Fig. 2 (a), we show the input to our map prediction algorithm. In (b), we show that by only making a single prediction when multiple futures could exist results in a blurry image. In (c) and (d), we show that making multiple predictions allows our algorithm to represent the possible futures as distinct hypotheses. In (e), we show the difference in the hypotheses representing regions of high uncertainty.

We then cluster the image representing regions of high uncertainty and compute the region with highest variance across the hypotheses. We select this region as the next area to explore and compare with several alternative methods of robot exploration. As seen in Fig. 3, and further summarized in Table 3, we show a significant reduction in overall path length needed to explore the space compared to the alternative approaches. This shows uncertainty-aware prediction of occupied space can play a critical role in efficient exploration of unknown spaces.

5 Adaptive Crowd Navigation

In this section, we extend our uncertainty-aware occupancy map prediction to include prediction of humans in the environment. Our underlying assumption is that enabling prediction of pedestrians can play a critical role in adaptive crowd navigation policies that will ultimately lead to fewer collisions. As illustrated in Fig. 4, our goal is to show errors in pedestrian prediction can serve as a measure of ‘surprise’ (i.e. policy uncertainty) to help detect novel pedestrian motion not seen during training. By detecting novel pedestrian motion, we can alter the risk of the robot’s control policy thereby reducing the probability of collision.

Fig. 3.
figure 3

Trajectory of robot during exploration using (a) Frontier Exploration [31], (b) Information-Theoretic Bayesian Optimization [32, 33], (c) Frontier Exploration using Distance [34], (d) Our Method

Table 3. Total path length
Fig. 4.
figure 4

Adaptive crowd navigation policy that uses pedestrian intent and prediction error to adjust the risk profile of a control policy.

5.1 Pedestrian Prediction

Our pedestrian prediction algorithm is summarized in Fig. 5. This architecture consists of a generator and a discriminator network. The generator network includes a recurrent encoder network, a variational autoencoder, a recurrent decoder network, and an intent predictor. The discriminator network classifies samples representing either real or fake trajectories from the generator. Our main contribution lies in combining a probabilistic interpretation of the desired goal with embeddings learned from past trajectories. This approach allows us to make predictions of future pedestrian trajectories that meet or exceed state-of-the-art algorithms. The details of our pedestrian prediction approach are further presented in [1].

5.2 Crowd Navigation

For our adaptive crowd navigation algorithm, we leverage the CrowdSim simulation environment provided by [19]. CrowdSim supports a flexible interface that enables learning robot policies through simulated pedestrian motion. The pedestrian controller supports using an optimal reciprocal collision avoidance (ORCA) model to simulate pedestrian motion [14]. This model consists of parameters such as preferred velocity, the maximum distance and time to take into account neighboring agent behavior, pedestrian’s radius, and maximum velocity. In addition, CrowdSim provides an OpenAI Gym like environment [35] to experiment with reinforcement learning based policies controlling a robot’s actions to reach a target goal while avoiding obstacles.

Fig. 5.
figure 5

Our pedestrian prediction network architecture. The generator network consists of a recurrent encoder network, a variational autoencoder, an intent prediction module and recurrent decoder network. The discriminator network consists of a recurrent encoder network to distinguish between real and fake trajectories.

We use a similar state space as described by [19, 36] which consists of the following parameters with respect to the robot position as the origin and the x-axis pointing towards the goal: distance from robot position to goal, robot’s preferred velocity, actual velocity and radius. For each pedestrian, the state includes position, velocity, radius, and distance between pedestrian and robot.

The action space consists of 3 discrete speeds and 6 discrete rotation angles for a total of 18 actions. The reward function encourages the robot to successfully reach the target while avoiding collisions with other pedestrians, and is defined similar to [18, 19] as:

$$ R(\mathbf {s}, \mathbf {a}) = {\left\{ \begin{array}{ll} -0.25 &{} \text {if} \quad d_{min}< 0 \\ -0.1-d_{min}/2 &{} \text {else if} \quad d_{min} < 0.2\\ 1 &{} \text {else if} \quad \text {robot reached goal}\\ 0 &{} \text {o.w.} \end{array}\right. } $$

where \(d_{min}\) is the minimum distance separating the robot and the humans during the previous timestep.

To implement our adaptive crowd navigation policy, we train a risk averse and an aggressive policy. The aggressive policy consists of a preferred velocity of 2.0 m/s, and the risk averse policy is limited to 1.0 m/s. Our first set of experiments focuses on the performance of CADRL [36] and SARL [19] algorithms with pedestrian motion that is in-distribution of the training data and then outside of the distribution. We trained both algorithms with preferred robot velocities set to 2.0 m/s with a static pedestrian motion model. The starting and ending positions of the pedestrians were sampled uniformly inside a square of width 10 m. We followed this experiment with a series of tests where the adaptive control policy uses various methods of novelty detection of new pedestrian behaviors.

Our first evaluation compares SARL and CADRL with a one-class SVM with a radial basis kernel [37]. We train the one-class SVM algorithm based on the fixed pedestrian motion profile and evaluate its ability to detect novel distributions of pedestrian motion data. Our subsequent tests use deep learning based approaches for novelty detection including Social GAN [38] and our intent-aware pedestrian prediction algorithm. We trained both deep learning algorithms with the same fixed pedestrian motion profile that was used when training the original RL policies. We then tested pedestrian prediction with a changing distribution of pedestrian motion while allowing the robot to navigate towards the goal as described in Table 4. We compute an estimate of novelty by thresholding the prediction error, as measured using the final displacement error (FDE), by a value \(\alpha \). If the error exceeds \(\alpha \), the policy moves from an aggressive behavior to a risk averse policy with the goal of avoiding collisions. The value of \(\alpha \) was chosen by computing the mean and standard deviation of the FDE in the training set. In our experiments, \(\alpha \) was set to 3 standard deviations from the mean of the training set to eliminate outliers.

Table 4. ORCA model parameters

Our metrics to compare the relative performance of each algorithm consists of the number of successful trials, number of collisions, the average navigation time, the discomfort level, and the average reward. The discomfort level is defined as the frequency of the separating distance being less than the desired separation distance, in this case 0.2 m. The results after running 500 test cases are reported in Table 5. We use the notation \(method-p\) where method describes the method used (i.e. SVM, SARL, etc.) and p is the number of pedestrians in the scene.

The first two rows of Table 5 evaluate CADRL and SARL algorithms using pedestrian motion that is in-distribution of the training data. These algorithms generally performed well with SARL resulting in less collisions. Rows 3 and 4 then show the results of CADRL and SARL in the presence of out-of-distribution, novel pedestrian motion in which both algorithms have a significantly higher number of collision and discomfort rates. The one-class SVM algorithm to detect novel pedestrian motion was only able to show a modest improvement in reducing the number of collisions by 2 compared to the SARL algorithm.

The subsequent rows of Table 5 compare deep neural network approaches to detect novel pedestrian motion. Specifically, we compare our intent aware approach with that of Social GAN. Using Social GAN improved the one-class SVM by further reducing the number of collisions by 20. Our intent-aware pedestrian prediction algorithm provided the best performance across almost all of the metrics. We were able to further reduce the number of collisions by 5 compared to Social GAN and overall by 30 compared to the non-adaptive SARL policy. In addition to reducing the number of collisions, we also reduce overall discomfort rate and increase overall reward. Further, we show that these benefits scale as the number of pedestrians in the scene extend from 5 to 20.

Table 5. Quantitative analysis of collision avoidance

6 Hardware Prototype

We further assess the real world applicability of our algorithms by evaluating in a proof-of-concept physical test environment (Fig. 6). Our physical test bed consists of the MIT Rapid Autonomous Complex-Environment Competing Ackermann-steering Robot (RACECAR) navigating through several pedestrians. This platform consists of a Hokuyo UST-10LX LiDAR, Sparkfun IMU, the Traxxas 1/10-scale chassis and an onboard NVidia Jetson processor with GPU.

We use a leg detector algorithm based on an SVM classifier to detect pedestrians with respect to the camera frame and a custom SLAM library to generate maps and estimate robot position. Using the pose of the robot with respect to the global frame and the pedestrian with respect to the robot, we transform the pedestrians to a global coordinate frame allowing us to run our trained navigation policies directly from simulations without requiring further training on the physical hardware. As part of this hardware demonstration, we were able to verify that the trained policy can transfer from simulation to the physical robot, execute in real time, operate on noisy sensors and is able to successfully reach its goal while navigating around pedestrians to avoid collisions.

Fig. 6.
figure 6

Hardware demonstration using MIT Racecar navigating around moving pedestrians.

7 Conclusion

In this paper, we highlight the importance of integrating predictive capabilities to mobile robotic systems as we believe this is a critical capability needed to improve the robustness and efficiency of navigation policies. We first experiment with the ability to predict beyond the robot’s line-of-sight using spatial structures as priors. We investigated several neural network architectures and evaluated the general trade-offs between performance and generation of high quality predictions. Next, we show extensions to occupancy map prediction by making multiple hypotheses. Using the multiple hypotheses, we compute regions of high uncertainty as a heuristic for efficient map exploration. We further extend this approach to include prediction of pedestrian motion. We show that by measuring the error between the predicted and observed trajectories, we can detect novel pedestrian motion. This enables a risk sensitive control policy resulting in significantly less collisions compared to alternative methods.

While the work present here show promising results, there are a number of possible extensions for further exploration. We can extend our map prediction algorithms to predict semantic information beyond occupied spaces. We can also improve information-theoretic exploration by using additional forms of uncertainty estimation including bootstrapping [39] and stochastic dropout [40, 41]. For pedestrian prediction, we can incorporate semantic information, group and social dynamics and additional forms of intent estimation to generate better predictions. Finally, we can extend our adaptive, risk-sensitive navigation policy to continually learn in the presence of novel pedestrian motion.