1 Introduction

A common assumption in computer vision is that light travels unaltered from the scene to the camera. In clear weather, this assumption is reasonable: the atmosphere behaves like a transparent medium and transmits light with very little attenuation or scattering. However, inclement weather conditions such as rain fill the atmosphere with particles producing spatio-temporal artifacts such as attenuation or rain streaks. This creates noticeable changes to the appearance of images (see Fig. 1), thus creating additional challenges to computer vision algorithms which must be robust to these conditions.

Fig. 1
figure 1

Vision tasks in clear and rain-augmented images. Our synthetic rain rendering framework allows for the evaluation of computer vision algorithms in challenging bad weather scenarios. We render physically-based, realistic rain on images from the KITTI (Geiger et al. 2012) (rows 1–2) and Cityscapes (Cordts et al. 2016) (rows 3–4) datasets with object detection from mx-RCNN (Yang et al. 2016) (row 2), semantic segmentation from ESPNet (Sachin Mehta et al. 2018) (row 4). We also present a combined data-driven and physic-based rain rendering approach which we apply to the nuScenes (Caesar et al. 2020) (rows 5–6) dataset with depth estimation from Monodepth2 (Godard et al. 2019) (row 6). All algorithms are quite significantly affected by rainy conditions

While the influence of rain on image appearance is well-known and understood (Garg and Nayar 2005), its impact on the performance of computer vision tasks is not. Indeed, how can one evaluate what the impact of, say, a rainfall rate of 100 mm/h (a typical autumn shower) is on the performance of an object detector, when our existing databases all contain images overwhelmingly captured under clear weather conditions? To measure this effect, one would need a labeled object detection dataset where all the images have been captured under 100 mm/h rain. Needless to say, such a “rain-calibrated” dataset does not exist, and capturing one is prohibitive. Indeed, datasets with bad weather information are few and sparse, and typically include only high-level tags (rain or not) without mentioning how much rain is falling. While they can be used to improve vision algorithms under adverse conditions by including rainy images in the training set, they cannot help us in systematically evaluating performance degradation under increasing amounts of rain.

Alternatively, one can attempt to remove its effects from images—i.e., create a “clear weather” version of the image—prior to applying subsequent algorithms. For example, rain can be detected and attenuated from images (Garg and Nayar 2007; Barnum et al. 2010; Yang et al. 2017; Zhang et al. 2019; Liu et al. 2019). We experiment with this approach in Sect. 7.4. An alternative approach is to employ programmable lighting to reduce the rain visibility, by shining light between raindrops (de Charette et al. 2012). Unfortunately, these solutions either add significant processing times to already constrained time budgets, or require custom hardware. Instead, if we could systematically study the effect of weather on images, we could better understand the robustness of existing algorithms, and, potentially, increase their robustness afterwards.

In this paper, we propose methods to realistically augment existing image databases with rainy conditions. We rely on well-understood physical models as well as on recent image-to-image translations to generate visually convincing results. First, we experiment with our novel physics-based approach, which is the first to allow controlling the amount of rain in order to generate arbitrary amounts, ranging from very light rain (5 mm/h rainfall) to very heavy storms (200+ mm/h). This key feature allows us to produce weather-augmented datasets, where the rainfall rate is known and calibrated. Subsequently, we augment two existing datasets (KITTI Geiger et al. 2012 and Cityscapes Cordts et al. 2016) with rain, and evaluate the robustness of popular object detection and segmentation algorithms on these augmented databases. Second, we experiment with a combination of physics- and learning-based approaches, where a popular unpaired image-to-image translation method (Zhu et al. 2017) is used to convey a sense of “wetness” to the scene, and physics-based rain is subsequently composited on the resulting image. Here, we augment the nuScenes dataset (Caesar et al. 2020), and use it to evaluate the robustness of object detection and depth estimation algorithms. Finally, we also use the latter to refine algorithms using curriculum learning (Bengio et al. 2009), and demonstrate improved robustness on real rainy images.

In short, we make the following contributions. First, we present two different realistic rain rendering approaches: the first is a purely physic-based method and the second is a combination of a GAN-based approach and this physic-based framework. Second, we augment KITTI (Geiger et al. 2012), Cityscapes (Cordts et al. 2016), and nuScenes (Caesar et al. 2020) datasets with rain. Third, we present a methodology for systematically evaluating the performance of 14 popular algorithms—for object detection, semantic segmentation and depth estimation—on rainy images. Our findings indicate that rain affects all algorithms: performance drops of 15% mAP for object detection, 60% AP for semantic segmentation, and a 6-fold increase in depth estimation error. Finally, our augmented database can also be used to finetune these same algorithms in order to improve their performance in real-world rainy conditions.

This paper significantly extends an earlier version of this work published in Halder et al. (2019), by combining physics-based rendering with learning-based image-to-image translation methods, conducting a novel, more in-depth user study, evaluating depth estimation algorithms, comparing to a deraining approach, and by providing a more extensive evaluation of the performance improvement on real images. Our framework is readily usable to augment existing image with realistic rainy conditions. Code and data are available at the following URL: https://team.inria.fr/rits/computer-vision/weather-augment/.

2 Related Work

Rain Modeling In their series of influential papers, Garg and Nayar provided a comprehensive overview of the appearance models required for understanding (Garg and Nayar 2007) and synthesizing (Garg and Nayar 2006) realistic rain. In particular, they propose an image-based rain streak database (Garg and Nayar 2006) modeling the drop oscillations, which we exploit in our physics-based rendering framework. Other streak appearance models were proposed in Weber et al. (2015); Barnum et al. (2010) using a frequency model. Realistic rendering was also obtained with ray-tracing (Rousseau et al. 2006) or artistic-based techniques (Tatarchuk 2006; Creus and Patow 2013) but on synthetic data as they require complete 3D knowledge of the scene including accurate light estimation. Numerous works also studied generation of raindrops on-screen, with 3D modeling and ray-casting (Roser and Geiger 2009; Roser et al. 2010; Halimeh and Roser 2009; Hao et al. 2019) or normal maps (Porav et al. 2019) some also accounting for focus blur.

Rain Removal Due to the problems it creates on computer vision algorithms, rain removal in images got a lot of attention initially focusing on photometric models (Garg and Nayar 2004). For this, several techniques have been proposed, ranging from frequency space analysis (Barnum et al. 2010) to deep networks (Yang et al. 2017). Sparse coding and layers priors were also important axes of research (Li et al. 2016; Luo et al. 2015; Chen and Hsu 2013) due to their facility to encode streak patches. Recently, dual residual networks have been employed (Liu et al. 2019). Alternatively, camera parameters (Garg and Nayar 2005) or programmable light sources (de Charette et al. 2012) can also be adjusted to limit the impact of rain on the image formation process. Additional proposals were made for the specific task of raindrops removal on windows (Eigen et al. 2013) or windshields (Halimeh and Roser 2009; Porav et al. 2019; Hao et al. 2019).

Unpaired Image Translation An interesting solution to weather augmentation is the use of data-driven unpaired image translation frameworks. Zhang et al. (2019) proposed to use conditional GANs for rain removal. By proposing to add a cyclic loss to the learning process, CycleGAN (Zhu et al. 2017) became a significant paper for unpaired image translation; they produced interesting results in weather and season translation. DualGAN (Yi et al. 2017) uses similar ideas with differences in the network models. The UNIT (Liu et al. 2017), MUNIT (Huang et al. 2018), and FUNIT (Liu et al. 2019) frameworks all, in one way or another, propose to perform image translation with the common idea that data from different sets have a shared latent space. They showed interesting results on adding and removing weather effect to images. Since the information in clear and rainy images is symmetrical, many unsupervised image translation approaches could produce decent visual results. In (Pizzati et al. 2020), Pizzati et al. learn to disentangle the scene from lens occlusions such as raindrops, which improves both realism and physical accuracy of the translations. Another strategy for better qualitative translations is to rely on semantic consistency (Li et al. 2018; Tasar et al. 2020).

Weather Databases In computer vision, few images databases have precise labeled weather information. Of note for mobile robotics, the BDD100K (Yu et al. 2018), the Oxford dataset (Maddern et al. 2017), and Wilddash (Zendel et al. 2018) provide data recorded in various weather conditions, including rain. Other stationary camera datasets such as AMOS (Jacobs et al. 2007), the transient attributes dataset (Laffont et al. 2014), the Webcam Clip Art dataset (Lalonde et al. 2009), or the WILD dataset (Narasimhan et al. 2002) are sparsely labeled with weather information. The relatively new nuScenes dataset (Caesar et al. 2020) have multiple labeled scenes containing rainy images, but variation in rain intensities are not indicated. Gruber et al. (2019) recently released a dataset with dense depth labels under a variety of real weather conditions produced by a controlled weather chamber, which inherently limits the variety of scenes (limited to four common scenarios) in the dataset. Note that Bijelic et al. (2020) also announced—but at the time of writing, not yet fully available—a promising dataset including heavy snow and rain events. Still, datasets with rainy data are too small to train algorithms and there exists no dataset with systematically recorded rainfall rates and object/scene labels. The closest systematic works in spirit (Johnson-Roberson et al. 2016; Khan et al. 2019) evaluated the effect of simulating weather on vision, but did so in purely virtual environments (GTA and CARLA, respectively) whereas we augment real images. Of particular relevance to our work, Sakaridis et al. (2018) propose a framework for rendering fog into images from the Cityscapes (Cordts et al. 2016) dataset. Their approach assumes a homogeneous fog model, which is rendered from the depth estimated from stereo. Existing scene segmentation models and object detectors are then adapted to fog. In our work, we employ the similar idea of rendering realistic weather on top of existing images, but we focus on rain rather than fog.

3 Rain Augmentation

Broadly speaking, synthesizing rain on images can be achieved using two seemingly antagonistic methods: (1) physics-based rendering (PBR) methods (Halimeh and Roser 2009; Garg and Nayar 2006), which explicitly model the dynamics and the radiometry of rain drops in images; or (2) learning-based image-to-image translation approaches (Zhu et al. 2017; Huang et al. 2018), which train deep neural networks to “translate” an image into its rainy version. While completely different, we argue both these methods offer complementary advantages. On one hand, physics-based approaches are accurate, controllable, can simulate a wide variety of imaging conditions, and do not require any training data. On the other hand, learning-based approaches can realistically simulate important visual cues such as wetness, cloud cover, and overall gloominess typically associated with rainy images.

In this paper, we propose to first explore the use of both techniques independently, then to combine them into a hybrid approach. This section thus first describes our PBR approach (Sect. 3.1), followed by the image-to-image translation with a GAN (Sect. 3.2), and concludes with the hybrid combination of the two GAN + PBR (Sect. 3.3).

3.1 Physics-Based Rendering (PBR)

Taking inspiration from the vast literature on rain physics, (Marshall and Palmer 1948; van Boxel 1997; Garg and Nayar 2007; Narasimhan and Nayar 2002) we simulate the rain appearance in an arbitrary image with the approach summarized in Fig. 2. Based on the estimated scene depth, a fog-like attenuation layer is first generated. Individual rain streaks are subsequently generated, and composited with the fog-like layer. The final result is blended in the original image to create a realistic, physics-based and controllable rainfall rate.

Fig. 2
figure 2

Physics-based rendering for rain augmentation. We use particles simulation (de Charette et al. 2012) together with depth (Godard et al. 2017, 2019) and illumination (Cameron 2005) estimation to render arbitrarily controlled rainfall on clear images. Rain streak appearance is rendered using Garg and Nayar (2006) rain streaks database

3.1.1 Fog-Like Rain

Following the definition of Garg and Nayar (2007), fog-like rain is the set of drops that are too far away and that project on an area smaller than 1 pixel. In this case, a pixel may even be imaging a large number of drops, which causes optical attenuation (Garg and Nayar 2007). In practice, most drops in a rainfall are actually imaged as fog-like rainFootnote 1, though their visual effect is less dominant.

We render the volumetric attenuation using the model described in Weber et al. (2015) where the per-pixel attenuation \(I_\text {att}\) is expressed as the sum of the extinction \(L_\text {ext}\) caused by the volume of rain and the airlight scattering \(A_\text {in}\) that results of the environmental lighting. Using equations from Weber et al. (2015) to model the attenuation image at pixel \(\mathbf {x}\) we get

$$\begin{aligned} I_\text {att}(\mathbf {x}) = I L_\text {ext}(\mathbf {x}) + A_\text {in}(\mathbf {x}) , \end{aligned}$$
(1)

where

$$\begin{aligned} \begin{aligned} L_\text {ext}(\mathbf {x})&= e^{-0.312R^{0.67}d(\mathbf {x})} , \\ A_\text {in}(\mathbf {x})&= \beta _\text {HG}(\theta ) \bar{E}_\text {sun} (1 - L_\text {ext}(\mathbf {x})) . \end{aligned} \end{aligned}$$
(2)

Here, R denotes the rainfall rate R (in mm/h), \(d(\mathbf {x})\) the pixel depth, \(\beta _\text {HG}\) the standard Heynyey-Greenstein coefficient, and \(\bar{E}_\text {sun}\) the average sun irradiance which we estimate from the image-radiance relation (Horn et al. 1996).

3.1.2 Simulating the Physics of Raindrops

We use the particles simulator of de Charette et al. (2012) to compute the position and dynamics of all raindrops greater than 1 mm for a given fall rateFootnote 2. The simulator outputs the position and dynamics (start and end points of streaks) of all visible rain drops in both world and image space, and accounts for intrinsic and extrinsic camera calibration.

3.1.3 Rendering the Appearance of a Rain Streak

While ray casting allows exact modeling of drops photometry, this comes at very high processing cost and is virtually only possible in synthetic scenes where the geometry and surface materials are perfectly known (Rousseau et al. 2006; Halimeh and Roser 2009). What is more, drops oscillate as they fall, which creates further complications in modeling the light interaction. Instead, we rely on the raindrop appearance database of Garg and Nayar (2007), which contains the individual rain streaks radiance when imaged by a stationary camera. For each drop, the streak database also models 10 oscillations due to the airflow, which accounts for much greater realism than Gaussian modeling (Barnum et al. 2010).

To render a raindrop, we first select a rain streak \(S \in \mathcal {S}\) from the streak database \(\mathcal {S}\) of Garg and Nayar (2006), which contains 20 different streaks (each with 10 different oscillations) stored in an image format. We select the streak that best matches the final drop dimensions (computed from the output of the physical simulator), and randomly select an oscillation.

The selected rain streak S is subsequently warped to match the drop dynamics from the physical simulator:

$$\begin{aligned} S' = \mathcal {H}(S) , \end{aligned}$$
(3)

where \(\mathcal {H}(\cdot )\) is the homography computed from the start and end points in image space given by the physical simulator and the corresponding points in the database streak image.

3.1.4 Computing the Photometry of a Rain Streak

Computing the photometry of a rain streak from a single image is impractical because drops have a much larger field of view than common cameras (\(165^\circ \) vs approx. 70\(^\circ \)–100\(^\circ \)). To render a drop accurately, we must therefore estimate the environment map (spherical lighting representation) around that drop. Sophisticated methods could be used (Hold-Geoffroy et al. 2017, 2019; Zhang et al. 2019) but we employ (Cameron 2005) which approximates the environment map through a series of simple operations on the image.

Fig. 3
figure 3

Estimation of raindrop photometry. To estimate the photometric radiance of a drop, we integrate the lighting environment map over the 165\(^{\circ }\) drop field of view (a) relying on an estimate of the environment map E shown in (b) using method of (Cameron 2005). The projected field of view (F) of the drop is outlined in red

From each camera relative 3D drop position, we compute the intersection F of the drop field of view with the environment map E, assuming a 10m constant scene distance. The process is depicted in Fig. 3, and geometrical details are provided in “Appendix A”. Note that geometrically exact drop field of view estimation requires location-dependent environment maps, centered on each drop. However, we consider the impact negligible since drops are relatively close to the camera center compared to the sphere radius usedFootnote 3.

Since a drop refracts 94% of its field of view radiance and reflects 6% of the entire environment map radiance (Garg and Nayar 2007), we multiply the streak appearance with a per-channel weight:

$$\begin{aligned} S' = S' (0.94 \bar{F} + 0.06 \bar{E}) , \end{aligned}$$
(4)

where \(\bar{F}\) is the mean of the intersection region F, and \(\bar{E}\) is the mean of the environment map E.

3.1.5 Compositing a Single Rain Streak on the Image

Now that the streak position and photometry were determined from the physical simulator and the environment map respectively, we can composite it onto the original image. First, to account for the camera depth of field, we apply a defocus effect following (Potmesil and Chakravarty 1981), convolving the streak image \(S'\) with the circle of confusionFootnote 4C, that is: \(S' = S' * C\).

We then blend the rendered drop with the attenuated background image \(I_{\text {att}}\) using the photometric blending model from Garg and Nayar (2007). Because the streak database and the image I are likely to be imaged with different exposures, we need to correct the exposure to match the imaging system used in I. Suppose a pixel \(\mathbf {x}\) of the image I and \(\mathbf {x}'\) the overlapping coordinates in streak \(S'\), the composite is obtained with

$$\begin{aligned} I_{\text {rain}}(\mathbf {x}) = \frac{T - S'_{\alpha }(\mathbf {x'})\tau _1}{T}I_{\text {att}}(\mathbf {x}) + S'(\mathbf {x'})\frac{\tau _1}{\tau _0}, \end{aligned}$$
(5)

where \(S'_{\alpha }\) is the alpha channelFootnote 5 of the rendered streak, \(\tau _0 = \sqrt{10^{-3}} / 50\) is the time for which the drop remained on one pixel in the streak database, and \(\tau _1\) the same measure according to our physical simulator. We refer to “Appendix B” for details.

3.1.6 Compositing Rainfall on the Image

The rendering of rainfall of arbitrary rates in an image is done in three main steps: 1) the fog-like attenuated image \(I_{\text {att}}\) is rendered (Eq. 1), 2) the drops outputted by the physical simulator are rendered individually on the image (Eq. 5), 3) the global luminosity average of the rainy image denoted \(I_{\text {rain}}\) is adjusted. While rainy events usually occur in cloudy weather which consequently decreases the scene radiance, a typical camera imaging system adjusts its exposure to restore the luminosity. Consequently, we adjust the global luminosity factor so as to restore the mean radiance, and preserves the relation \(\bar{I} = \bar{I}_{\text {rain}}\), where the overbar denotes the intensity average.

Fig. 4
figure 4

Photometric validation of rain. Rain rendering using ground truth illumination or our approximated environment map. From HDR panoramas (Hold-Geoffroy et al. 2019), we first extract limited field of view crops to simulate the point of view of a regular camera. Then, 50 mm/h rain is rendered using either (rows 1, 3) the ground truth HDR environment map or (rows 2, 4) our environment estimation. The environment maps are shown as reference on the left. While our approximated environment maps differ from the ground truth, they are sufficient to generate visually similar rain in images

Photometric Validation A limitation of our physical pipeline is the lighting estimation which impacts the photometry of the rain. To measure its effect, in Fig. 4 we compare the same rain rendered with our estimated environment map or ground truth illumination obtained from high dynamic range panoramas (Hold-Geoffroy et al. 2019). Overall, our estimation differs from ground truth when the scene is not radially symmetric but we observe that it produces visually similar rain in images.

3.2 Image-to-Image Translation (GAN)

While our physic-based rendering generates realistic rain streaks and fog-like rain effect, it ignores major rainy characteristics such as wetness, reflections, clouds and thus may fail at conveying the overall look of a rainy scene. Conversely, generative adversarial networks (GAN) excel at learning such visual characteristics as they constitute strong signals for the discriminator in the learning process.

We further learn the \(\text {clear}\mapsto {}\text {rain}\) mapping with CycleGAN (Zhu et al. 2017) from a set of unpaired clear/rain images. We train our model with the \(256\times 256\) architecture from Zhu et al. (2017) on images of input size \(448\times 256\). The generator is similar to Johnson et al. (2016) with 2 downsampling blocks followed by 9 ResNet blocks and 2 upsampling blocks. The discriminator is a simple 3 hidden layers ConvNet similar to the one used in PatchGAN (Isola et al. 2017). The model is optimized for 40 epochs with Adam, using batch size of 1, a learning rate of 0.0002, and \(\beta = \left\{ 0.5, 0.999 \right\} \).

3.3 Combining GAN and PBR

Fig. 5
figure 5

GAN + PBR rain-augmentation architecture. In this hybrid approach, clear images are first translated into rain with CycleGAN (Zhu et al. 2017) and subsequently augmented with rain streaks with our PBR pipeline (see Fig. 2)

We combine both the PBR- and the GAN-based rain generation methods together, simply by first translating the image to its rainy version with the GAN, then compositing the rain layer onto the resulting image using PBR (see Fig. 5). The sun irradiance estimation \(\bar{E}_{sun}\) (Sect. 3.1.1) of the “translated” image is typically darker, which, in turn, makes the fog-like rain more realistic. The estimated environment map is also darker and, consequently, so is its mean value \(\bar{E}\). The appearance of rain streaks will thus remain coherent with their environment.

Since rain streaks smaller than 1px in diameter are ignored before the rendering and instead generated as fog-like rain (Sect. 3.1.1), we need to apply the PBR renderer at full resolution by upsampling the output of the GAN to the image original size. Once the rain rendering is complete, we downsample the augmented image at \(448\times 256\). We further refer to this hybrid rendering as GAN + PBR.

4 Validating Rain Appearance

We now validate the appearance of our synthetic rainy images when using either of our rain augmentation pipelines. We observe visual results and quantify their perceptual realism by comparing them to existing rain augmentation approaches.

Fig. 6
figure 6

Comparison of real photographs and our renderings. Real photographs (source: web, Luo et al. 2015; Caesar et al. 2020) showing various rain intensity, sample output of our rain rendering (PBR, GAN, and GAN + PBR), and other recent rain rendering methods (rain100H, Yang et al. 2017; rain800, Zhang et al. 2019; did-MDN, Zhang and Patel 2018). Although rain appearance is highly camera-dependent (Garg and Nayar 2005), results show that both real photographs and our rain generation share volume attenuation and sparse visible streaks which correctly vary with the scene background. As opposed to the other rain rendering methods, our pipeline simulates physical rainfall (here, 100 and 200 mm/h) and valid particles photometry

4.1 Qualitative Evaluation

Figure 6 presents real photographs captured under heavy rain, qualitative results of our rain renderings on images from nuScenes (Caesar et al. 2020) and representative results from 3 recent synthetic rain augmented approaches (Zhang and Patel 2018; Yang et al. 2017; Zhang et al. 2019). From the real rain photographs, it is noticeable that rainy scenes have complex visual appearance, and that the streaks visibility is greatly affected by the imaging device and background.

Our PBR approach is able to reproduce the complex pattern of streaks with orientation consistent with camera motion and photometry consistent with background and depth. As in the real photographs, the streaks are sparse and only visible on darker backgrounds. The veiling effect caused by the rain volume (i.e. fog-like rain) is visible where the scene depth is larger (i.e. image center, sky) and nearby streaks are accurately defocused. Still, the absence of visible wetness arguably affects the rainy feeling.

Conversely, the GAN believably renders the wetness appearance of rainy scenes. While some reflections are geometrically incorrect (e.g., a pole is reflected in the middle of the street in the left column of Fig. 6 yet no pole is present), the overall appearance is visually pleasant and the global illumination matches that of real photographs. A noticeable artifact caused by GAN is the blurry appearance of images, whereas real rain images are only blurred in the distance. This is explained by the inability of the GAN to disentangle the scene from the lens drops present in the “rainy” training images. This leads to blurring the whole image being an easy learning optimum, as highlighted in Pizzati et al. (2020). Another limitation we already mentioned is that GAN does not allow to control the amount of rain in the image.

This limitation is circumvented by our GAN + PBR approach which renders controllable rain streaks while preserving the global wetness appearance learned with image translation. Despite the naive GAN and PBR compositing strategy, the drops naturally blend in the scene.

4.2 User Study

To evaluate the perceptual quality of our rain renderings, we conducted two user studies. In the first, users were shown one image at a time, and asked to rate if rain looks realistic on a 5-point Likert scale. A total of 42 images were shown, that is 6 for each of the following: real rainy photographs, ours (PBR, GAN, GAN + PBR), and previous approaches (Yang et al. 2017; Zhang and Patel 2018; Zhang et al. 2019). Answers were obtained for a total of 67 participants, aged from 22 to 75 (avg 37.0, std 14.2), with 32.8% females.

From the Mean Opinion Score (MOS) in Fig. 7, all our rain augmentation approaches are judged to be more realistic than any of the previous approaches. Specifically, when converting ratings to the [0, 1] interval, the mean rain realism is 0.77 for real photos, 0.44 for PBR, 0.68 for GAN, 0.52 for GAN + PBR, and 0.30/0.23/0.08 for Zhang and Patel (2018), Zhang et al. (2019) and Yang et al. 2017 respectively. Despite physical and geometrical inconsistencies, the users consistently judged GAN images to be more realistic. This is in favor of using image-to-image translation rather than physics-based rendering for realism purposes. However, for benchmarking or physical accuracy purposes, GAN + PBR allows us to have an arbitrary control on the rain amount at the cost of slightly lower realism.

In the second study, we asked respondents who participated in the first study to determine, for each of the same images as before (excluding images from the previous work), which visual characteristics influenced their decisions. 52 of the original participants responded, aged from 22 to 72 (avg 36.6, std 13.9) including 28.9% females. Results are reported in Fig. 8. We note that while falling rain and wetness are the main characteristics in real rain, GAN—judged the most realistic approach—fails to convey the falling rain appearance but excels at rendering wetness. The opposite is observed with PBR, though few users indicated wetness (despite its absence). The GAN + PBR offers a trade-off balancing all characteristics.

Fig. 7
figure 7

User study of rainy images realism. The y-axis displays ratings to the statement “Rain in this image looks realistic”. All of our approaches significantly outperform existing techniques

Fig. 8
figure 8

User study on images characteristics conveying rain. The y-axis displays ratings to the statement “Which of these qualities help in determining the realism of the rain”

5 Evaluating the Impact of Real Rain

We now aim at quantifying the impact of rain on three computer vision tasks: object detection, semantic segmentation, and depth estimation. These tasks are critical for any outdoor vision systems such as mobile robotics, autonomous driving, or surveillance. Before we experiment on rain-augmented images with our PBR, GAN, and GAN + PBR approaches, we first experiment on real rainy photographs.

For this, we use the nuScenes dataset (Caesar et al. 2020). Benefiting from coarse frame-wise weather annotations in the dataset, we split nuScenes imagesFootnote 6 in two subsets: “nuScenes-clear” (images without rain) and “nuScenes-rain” (rainy images). Due to the noisy weather labels, we cross-validated each frame with a historic weather databaseFootnote 7 using GPS location and time, and only kept frames where the nuScenes label agreed with the weather database. This resulted in sets of 24,134 images for nuScenes-clear, and 6028 for nuScenes-rain. In this dataset, rain images are dark, gloomy, the sky is heavily covered, and no falling rain is visible but there are unfocused raindrops occlusions on the lens.

We experiment on these sets of real images with algorithms from each task: YOLOv2 (Redmon and Farhadi 2017) for object detection, PSPNet (Zhao et al. 2017) for semantic segmentation, and Monodepth2 (Godard et al. 2019) for depth estimation. The nuScenes-clear set is split into train/test subsets of 19,685/4449 images from 491/110 scenes. The nuScenes-rain is also split into train/test subsets of 5419/609 images from 134/15 scenes. Here, 1000 images from the nuScenes-clear(test) and all images from the nuScenes-rain(test) subset are used for evaluation (train will be used for the GAN in Sect. 6.1). These training and testing sets and subsets are all displayed in Table 3 in “Appendix C”. Note that a large number of images were needed to train the CycleGAN for image translation and, to avoid any overlap in image sequences, it unfortunately left a relatively small subset of nuScenes for the evaluation on real rainy images.

Table 1 Vision tasks on real clear/rain images from nuScenes (Caesar et al. 2020)
Fig. 9
figure 9

Qualitative results on real rainy images from the nuScenes dataset. Shown for different tasks: object detection (top line), semantic segmentation (middle line), and depth estimation (bottom line)

For object detection and depth estimation, we first pre-train each algorithm on ImageNet (Darknet53, \(448\times 448\)) and KITTI (monocular, \(1024\times 320\)) respectively, and further finetune them on the nuScenes-clear(train) subset to limit the domain gap between datasets. We then evaluate their performance on the aforementioned test subsets. For segmentation, since semantic labels are not provided on nuScenes, we carefully annotated 25 images from both nuScenes-clear(test) and nuScenes-rain(test). Since we do not have enough labeled data for finetuning, we use our model pretrained on Cityscapes (Cordts et al. 2016), with the caveat that there may be a significative domain gap between training and evaluation.

Table 1 reports the results of this experiment. The performance on real rainy images compared to clear images for all three tasks are a mAP of 16.30% instead of 32.53% for object detection, an AP of 18.7% instead of 40.8% for semantic segmentation and a square relative error of 3.53% instead of 2.96% for depth estimation. Corresponding qualitative results are displayed in Fig. 9. As expected, real rain deteriorates the performance of all algorithms on all tasks. However, we cannot evaluate how rain intensity affects these algorithms since it would require the accurate measurement of the rainfall rate at the time of capture.

6 Evaluating the Impact of Synthetic Rain

To study how vision algorithms perform under increasing amounts of rain, we leverage our rain synthesis pipeline and augment popular clear-weather datasets. Specifically, our PBR and GAN + PBR frameworks allow us to measure the performance of these algorithms in controlled rain settings.

6.1 Rain Generation Setup

We augment all three of the KITTI, Cityscapes, and nuScenes-clear datasets with PBR. We generate rainfall rates ranging from light to heavy storm \(R = \{0, 5, 25, 50, 100, 200\}\) mm/h. Only the nuScenes-clear dataset is augmented with GAN and GAN + PBR, since neither KITTI nor Cityscapes contain rainy images to train the GAN. Our PBR and GAN + PBR rain augmentation require some preparation as they rely on calibration, depth and camera motion. Our GAN and GAN + PBR rain augmentation require the training of a CycleGAN. These preparations are described below.

Calibration For the realistic physical simulator (Sect. 3.1.2) and the rain streaks photometric simulation (Sect. 3.1.4), intrinsic and extrinsic calibration are used to replicate the imaging sensor. We used frame-wise or sequence-wise calibration for KITTI and nuScenes. In addition, we use 6mm focal and 2ms exposure for KITTI (Geiger et al. 2012, 2013) and assumed 5ms exposure for nuScenes. As Cityscapes does not provide calibration, we use intrinsic from the camera manufacturer with 5ms exposure and extrinsic is assumed similar to KITTI.

Depth The scene geometry (pixel depth) is also required to model accurately the light-particle interaction and the fog optical extinction. We estimate KITTI depth maps from RGB + Lidar with Jaritz et al. (2018), and Cityscapes/nuScenes from monocular RGB with Monodepth (Godard et al. 2017, 2019). While absolute depth is not required, we aim to avoid the critical artifacts along edges, and thus further align RGB with depth using guided filter (Barron and Poole 2016).

Camera Motion We mimic the camera ego motion in the physical simulator to ensure realistic rain streak orientation on still images and preserve temporal consistency in sequences. Ego speed is extracted from GPS data when provided (KITTI and nuScenes), or drawn uniformly in the [0, 50] km/h interval for Cityscapes semantics and in the [0, 100] km/h interval for KITTI object to reflect the urban and semi-urban scenarios, respectively.

CycleGAN A CycleGAN is trained for image-to-image rain translation on the train subsets of nuScenes-clear and nuScenes-rain (Sect. 5). In order to make sure that no image is used to both train and evaluate the GAN simultaneously, we use the 4449 images from the nuScenes-clear(test) subset, resize them to \(448\times 256\), and perform image-to-image translation to generate GAN-augmented rain images. We dub this new set of images “nuScenes-augment” for clarity, and will also use this for the GAN + PBR rain augmentation.

6.2 Evaluating PBR Rain Augmentation

Fig. 10
figure 10

Performance using our PBR rain augmentation. a Object detection performance on our weather augmented KITTI dataset, b pixel-semantic segmentation performance on our weather augmented Cityscapes dataset, and c depth estimation performance on our weather augmented nuScenes dataset, all of them as a function of rainfall rate. The object detection plot shows the Coco mAP@[.1:.1:.9] (%) across cars and pedestrians, the semantic segmentation plot shows the AP (%), and the depth estimation shows the squared relative error (%). As opposed to object detection which exhibits some robustness, the segmentation and depth estimation tasks are strongly affected by the rain

Fig. 11
figure 11

Object detection on PBR rain augmentation of KITTI. From left to right, the original image (clear) and three PBR augmentations with varying rainfall rates. Images are cropped for visualization

Fig. 12
figure 12

Qualitative evaluation of semantic segmentation on our PBR rain augmentation of Cityscapes. From left to right, the original image (clear) and three PBR augmentations with varying rainfall rates

Fig. 13
figure 13

Depth estimation on our PBR rain augmentation of nuScenes. From left to right, the original image (clear) and three PBR augmentations with varying rainfall rates

We compare the performance on PBR-augmented images for 6 object detection algorithms on KITTI (Geiger et al. 2012) (7481 images), 6 segmentation algorithms on Cityscapes (Cordts et al. 2016) (2995 images) and 2 depth estimation algorithm on nuScenes-augment (4449 images). For all of the algorithms, the clear version always serves as a baseline to which we compare the performance on synthetic rain translations.

Object Detection We evaluate the 6 PBR augmented weathers on KITTI for 6 car/pedestrian pre-trained detection algorithms (with \(\text {IoU}\ge .7\)): DSOD (Shen et al. 2017), Faster R-CNN (Ren et al. 2015), R-FCN (Dai et al. 2016), SSD (Liu et al. 2016), MX-RCNN (Yang et al. 2016), and YOLOv2 (Redmon and Farhadi 2017). Quantitative results for the Coco mAP@[.1:.1:.9] metric across classes are shown in Fig. 10a. Relative to their clear-weather performance the 200 mm/h rain is always at least 12% worse and even drops to 25–30% for R-FCN, SSD, and MX-RCNN, whereas Faster R-CNN and DSOD are the most robust to changes in fog and rain.

Representative qualitative results on PBR images are shown in Fig. 11 for 4 out of 6 algorithms to preserve space. All algorithms are strongly affected by the rain; it has a chaotic effect on object detection results because there can be large variance of occlusion level for objects populating the image. Also, as in real-life, far away objects (which are generally small objects) are more likely to disappear behind fog-like rain.

Semantic Segmentation For semantic segmentation, the PBR augmented Cityscapes is evaluated for: AdaptSegNet (Tsai et al. 2018), ERFNet (Romera et al. 2018), ESPNet (Sachin Mehta et al. 2018), ICNet (Zhao et al. 2018), PSPNet (Zhao et al. 2017) and PSPNet(50) (Zhao et al. 2017). Quantitative results are reported in Fig. 10b. As opposed to object detection algorithms which demonstrated significant robustness to moderately high rainfall rates, here the algorithms seem to breakdown in similar conditions. Indeed, all techniques see their performance drop by a minimum of 30% under heavy fog, and almost 60% under strong rain. Interestingly, some curves cross, which indicates that different algorithms behave differently under rain. ESPNet for example, ranks among the top 3 in clear weather but drops relatively by a staggering 85% and ranks last in stormy conditions (200 mm/h). Corresponding qualitative results are shown in Fig. 12 for 4 out of 6 algorithms to preserve space. Although the effect of rain may appear minimal visually, it greatly affects the output of all segmentation algorithms evaluated.

Depth Estimation We evaluate the performance of the recent Monodepth2 (Godard et al. 2019) and BTS (Lee et al. 2020) on the nuScenes-augment subset augmented with our PBR method. We report the standard squared relative error in Fig. 10c and note that the error seems to increase linearly with rain. In the extreme 200 mm/h rain conditions, we measure an error of 3x that of clear images. Qualitative results are shown in Fig. 13. It can be observed that performance drops when raindrops block the view or fog-like limits the visibility in the image.

6.3 Evaluating GAN and GAN + PBR Rain Augmentations

Next, we ascertain the effect of the rain augmentation with our GAN and GAN + PBR augmentation strategies. As mentioned above, semantic segmentation algorithms are not evaluated due to the lack of semantic labels for training. The evaluation is performed on the nuScenes-augment subset.

Object Detection YOLOv2 (Redmon and Farhadi 2017) is evaluated, and the resulting mAP as a function of rainfall rates are reported in Fig. 14a. Qualitative results are shown in Fig. 15. We note that GAN augmented images have similar performance than PBR 100 mm/h images and that performance deterioration is stronger and steeper with GAN + PBR images.

Observe that the particles physical simulation is the same in both PBR and GAN + PBR. Still, it is interesting to notice that the decrease is different with GAN + PBR compared to PBR (i.e. curves are shifted but also exhibit different slopes). This may be the result of the two cumulative domain shifts (i.e. wetness + streaks) leading to a non-linear effect.

Depth Estimation We evaluate the performance of the recent Monodepth2 (Godard et al. 2019) on the nuScenes-augment subset augmented with our GAN and GAN + PBR methods. As reported in Fig. 14b, for the same rain intensity between PBR and GAN + PBR images, the error is worse by a factor of 80–100%. The same behavior was observed on other standard depth estimation metrics (absolute square error, RMSE, log RMSE), not reported here. Interestingly, the GAN augmentation affects only slightly the performance on depth estimation which might be because GAN translation keeps occlusion of the image to a minimum. Qualitative results are shown in Fig. 16 for GAN and GAN + PBR.

Fig. 14
figure 14

Performance with varying rain intensities and augmentation techniques. Opposed to PBR, GAN does not allow to control the rain intensity and is reported as dashed line, as for clear performance. Increasing rain intensity translates as a performance drop for a object detection with YOLOv2 (Redmon and Farhadi 2017) and b depth estimation with Monodepth2 (Godard et al. 2019)

7 Improving the Robustness to Rain

We now wish to demonstrate the usefulness of our rain rendering pipeline for improving robustness to rain through extensive evaluations on synthetic and real rain databases. For the sake of coherence, the improvements are shown on the same tasks, algorithms, and test data from Sect. 5.

7.1 Training Methodology

While the ultimate goal is to improve robustness to rain, we aim at training a single model which performs well across a wide variety of rainfall rates (including clear weather). Having a single model is beneficial over employing, e.g., intensity-specific encoders (Porav et al. 2019), since it removes the need for determining rain intensity from the input image. Because rain significantly alters the appearance of the scene, we found that training from scratch with heavy rain or random rainfall rates fails to converge. Instead, we refine our untuned models using curriculum learning (Bengio et al. 2009) on rain intensity in ascending order (25, then 50, and finally 100 mm/h rain). The final model is referred as finetuned and is evaluated against various weather conditions. Note that, for hybrid augmentation finetuning, the curriculum starts with the refinement on GAN images first and then go through the ascending rain intensities. The same images are used for all steps of the curriculum.

In order to avoid training and testing on the same set of images, in this section, we further divide the nuScenes-augment into train/test subsets of 1000 images each (ensuring they are taken from different scenes). Each algorithm is thus refined on the 1000 images from nuScenes-augment(train), and undergo a specific training process. For object detection, YOLOv2 (Redmon and Farhadi 2017) is trained each step at a learning rate of 0.0001 and a momentum of 0.9 for 10 epochs with a burn-in of 5 epochs. For semantic segmentation, PSPNet (Zhao et al. 2017) is trained with a learning rate of 0.0004 and a momentum of 0.9 for 10 epochs. Finally, for depth estimation, Monodepth2 (Godard et al. 2019) is trained on triplets of consecutive images using a learning rate of 0.00001 with the Adam optimizer for 10 epochs with \(\beta = \left\{ 0.5, 0.999 \right\} \).

7.2 Improvement on Synthetic Rain

The synthetic evaluation is conducted on the set of 1000 images from nuScenes-augment(test), with rain up to 200 mm/h. Note again, for our hybrid GAN + PBR, 0 mm/h of rain correspond to the GAN-only augmented results.

Figure 17 shows the performance of our untuned and finetuned model for the three vision tasks on different augmented dataset. Figure 17a and b are for object detection (YOLOv2 Redmon and Farhadi 2017) and depth estimation (Monodepth2 Godard et al. 2019) on nuScenes-clear augmented data while Fig. 17c is for the semantic segmentation (PSPNet Zhao et al. 2017) on Cityscapes. We observe a significant improvement in both tasks and additional increase in robustness even in clear weather when refined using our augmented rain. Of interest, we also improve at the unseen 200 mm/h rain though the network was only trained with rain up to 100 mm/h. The intuition here is that when facing adverse weather, the network learns to focus on strongest relevant features for all tasks and thus gain robustness.

PBR For YOLOv2, the finetuned detection performance stays higher than its clear untuned counterpart in the 0–200 mm/h interval. Explicitly, it goes from 34.5 to 31.0% whereas the untuned model starts at 34.6% and finishes at 20.4%. For PSPNet, the segmentation exhibits a significative improvement when refined although at 100 mm/h the model is not fully able to compensate the effect of rain and drops to 54.0 versus 52.0% when untuned. Monodepth2 finetuning helps only for higher rain intensity level (+ 25 mm/h) and the error differences between 100 and 200 mm/h stay in the same ballpark (~1.2%). This makes sense considering that since the occlusion created by rain streaks is minimum with a low rain intensity, the untuned model would not be strongly affected.

Fig. 15
figure 15

Object detection on our GAN + PBR augmented nuScenes. From left to right, the original image (clear), the GAN augmented image and three GAN + PBR images

Fig. 16
figure 16

Depth estimation on our GAN + PBR augmented nuScenes. From left to right, the original image (clear), the GAN augmented image and three GAN + PBR images

GAN + PBR In the case of YOLOv2, we notice a major difference between the hybrid GAN + PBR untuned and finetuned performances. Indeed, the hybrid finetuned performance at 100 mm/h is at 21.7% and only at a measly 7.5% for the untuned model. The same goes for Monodepth2 for hybrid images performance with 5.7 and 8.9% at 100 mm/h for finetuned and untuned respectively. It is interesting to note that, for all tasks, performance evaluated on finetuned hybrid image decrease slower than for untuned models. This demonstrates again that more robust models are learned when finetuning with our rain translations.

Fig. 17
figure 17

Original (untuned) or finetuned performance on rain-augmented versions of nuScenes (a)–(b) and Cityscapes (c). Algorithms used for object detection, depth estimation, and semantic segmentation are, respectively, YOLOv2 (Redmon and Farhadi 2017), Monodepth2 (Godard et al. 2019), and PSPNet (Zhao et al. 2017). Not only the finetuned models significantly outperform untuned models, but they exhibit a lower decrease with rain intensity, demonstrating increased robustness to both rain and clear weather

7.3 Improvement on Real Rain

We evaluate the performance on real rain, using our nuScenes-rain(test) subset of images (see Sect. 5). Table 2 shows that our finetuning leads to performance increase in real rainy scenes compared to untuned performance in rain. We note for object detection (PBR: + 20.7%, GAN: + 10.9%, GAN + PBR: + 21.0%), for semantic segmentation (PBR: + 36.9%), and for depth estimation (PBR: 0.0%, GAN: + 3.8%, GAN + PBR: + 8.2%) tasks. In clear weather, our finetuned model performs on par with the untuned version, sometimes even better. This boost in performance could be seen as the network learning to rely on more robust features, somehow invariant to rain streaks.

Depth estimation underperformance for PBR finetuning can be explained by the learning loss of Monodepth2 which is, in short, a reprojection error which would not fare well with rain streaks as they do not reproject in consecutive frames. Interestingly, this problem does not seem to affect the GAN or GAN + PBR finetuned model, possibly because the GAN is being trained on split of nuScenes subsequently leading to finetuning images that are more resembling of the test set. These results demonstrate the usefulness of our different rain rendering frameworks for real rain scenarios.

Table 2 Improving performance of computer vision tasks on real nuScenes  (Caesar et al. 2020) images

7.4 De-Raining Comparison

We now compare to the strategy of de-raining images first and then running un-tuned vision algorithms. To this end, we used the state-of-the-art de-raining method DualResNet (Liu et al. 2019), finetuned using nuScenes-clear augmented with GAN + PBR to accommodate for the domain gap.

During the de-raining fine-tuning process, random batches of \(\{25, 50, 100, 200\}\) mm/h paired with their non-augmented counterpart are generated. Except for a smaller learning rate (\(10^{-5}\)), we used the DualResNet default hyper-parameters (Adam optimizer, batch and crop size of 40 and 64 respectively).

With this de-raining finetuned model, we compare the performance of our “untuned” object detection (YOLOv2, Redmon and Farhadi 2017) and depth estimation (Monodepth2, Godard et al. 2019) models. Figure 18 shows the performance of the de-raining strategy compared to our “rain-aware” GAN + PBR finetuned models. Here, we observe that the rain-aware models offer improved performance for object detection over de-raining, while the latter improves depth estimation. This is likely due to the fact that streaks occlude the scene background, while de-raining acts as prior impainting thus easing depth estimation.

We also applied the same de-raining strategy to the real nuScenes images and report performance in the last row of Table 2. Again, for object detection on rainy images our rain-aware models perform better than de-raining. However, for depth estimation, the de-raining strategy is better for both clear and rainy images. This is consistent with the results obtained on synthetic data.

These experiments illustrate that de-raining is also a valid strategy that may even outperform “rain-aware” algorithms. However, this comes at the cost of having to perform two tasks, which may limit practical applications. On the long term, we believe rain-robust algorithms offer an exciting new research paradigm while avoiding the in-filling of occluded areas.

8 Discussion

In this paper, we presented the first intensity-controlled physical framework for augmenting existing image databases with realistic rain. This allows us to systematically study the impact of rain on existing computer vision algorithms on three important tasks: object detection, semantic segmentation, and depth estimation.

Limitations and Future Work While we demonstrated highly realistic rain rendering results, our approach still has limitations that set the stage for future work.

For our PBR approach, the approximation of the lighting conditions (Sect. 3.1.1) yielded reasonable results (Fig. 3), but it may under/over estimate the scene radiance when the sky is not/too visible. This approximation is more visible when streaks are imaged against a darker sky. More robust approaches for outdoor lighting estimation could potentially be used (Hold-Geoffroy et al. 2019; Zhang et al. 2019). Second, we make an explicit distinction between fog-like rain and drops imaged on more than 1 pixel, individually rendered as streaks. While this distinction is widely used in the literature (Garg and Nayar 2007, 2005; de Charette et al. 2012; Li et al. 2018), it causes an inappropriate sharp distinction between fog-like and streaks. A possible solution would be to render all drops as streaks weighting them as a function of their imaging surface. However, our experiments show it comes at a prohibitive computation cost. Finally, rain streaks are added to image irrespective of the scene contents. Here, the depth estimate could be used to mask out streaks that appear behind objects and under the ground plane. Another limitation is the computational cost of PBR. While this has no downside for benchmarking purpose as PBR may be run off-line, simulation requires increasing time with larger rainfall rates. With our current unoptimized implementation, the simulation of 1/25/50/100 mm/h rainfall rates on Cityscapes requires 0.35/5.65/16.60/20.67 s respectively for the rain physics (de Charette et al. 2012) and an additional 6.71/34.92/62.94/104.76 s for the rendering (times are per image, on single core, and averaged over 100 frames). This restricts the usage of PBR to off-line processing though significant speed up could be obtained at the cost of additional optimization efforts.

The GAN employed also has limitations. First, while PBR is well-suited for videos since the rain simulator is temporally consistent, this is not the case for CycleGAN which does not guarantee temporal smoothness. Existing approaches such as (Bansal et al. 2018) are alternatives, but GANs are known for their non-realistic physical outcome (Xie et al. 2018). Second, CycleGAN imposes a limit on the image resolution. Here, super-resolution networks such as SRResNet (Dong et al. 2015) or SRGAN (Ledig et al. 2017) could potentially be used, or large-scale GANs such as BigGAN (Brock et al. 2018) are also an option. More importantly, GANs tend to have difficulty in generating rain on images of datasets different than which they are trained on since the learning process does not disentangle rain from scene appearance, demonstrating a strong domain dependence.

Fig. 18
figure 18

Performance with varying rain intensities on de-rained GAN + PBR synthetic images. The de-raining is performed with Liu et al. (2019). We see that the performance on both tasks decrease linearly for both a object detection with YOLOv2 (Redmon and Farhadi 2017) and b depth estimation with Monodepth2 (Godard et al. 2019). Performance on both tasks are lower at low rain intensities (and the opposite at high rain intensities) compared to models finetuned with GAN + PBR synthetic images (cf. Fig. 17)

Finally, while our results demonstrate that fine-tuning on synthetically generated rain does improve performance on real rainy images (cf. Sect. 7), the improvements obtained are still quite modest. Further efforts are necessary to develop algorithms that are truly robust to challenging rainy conditions.