Abstract
Semantic segmentation models need a large number of images to be effectively trained but manual annotation of such images has a high cost. Active domain adaptation addresses this problem by pretraining the model with a synthetically generated dataset and then fine-tuning it with a few selected label annotations (the “budget”) on real images to account for the domain shift. Previous works annotate a percentage of either individual pixels or whole target images. We argue that the first is infeasible in practice, and the second spends part of the budget on classes that the pretrained model may have already learned well. We propose a method based on the annotation of regions computed by Segment Anything, a recently introduced foundation model for class-agnostic image segmentation. The key idea is to assign a ground truth label to each of a tiny subset of regions, those for which the model is more uncertain. In order to increase the number of annotated regions we propagate the ground truth labels to most similar regions according to a hierarchical clustering algorithm that uses the features learned by the pretrained model. Our method outperforms the state-of-the-art on the GTA5 to Cityscapes benchmark by using fewer annotations, almost closing the gap between the synthetically pre-trained model and that obtained with full supervision of the real images. Furthermore, we present competitive results for budgets less than 1% of samples and also for a larger and more challenging target dataset, Mapillary Vistas.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Semantic segmentation is an important task in computer vision with many applications, notably in the fields of medical imaging [1], robotics [2], remote sensing [3] and autonomous driving [4, 5]. A long-standing problem of the supervised training of semantic segmentation models is the availability of a sufficient number of labeled images. The reason is that labeling has a high cost in time, especially in the multiclass case. Human annotators have to manually outline the contours of potentially many objects of the classes of interest present in an image, which is a labor-intensive process. For instance, the creators of the Cityscapes and Mapillary-Vistas datasets, which are benchmarks of semantic segmentation of driving scenes, report an average annotation time of 1.5 h per image [6, 7]. The authors of the ACDC [8] dataset of the same domain but under adverse visual conditions (fog, nighttime, rain, snow) report an even higher average time of 3.3 h per image.
This well-known problem has been tackled through three main approaches. The first is unsupervised domain adaptation (UDA). The most successful variant of UDA relies on self-training [9,10,11,12,13,14]. It involves training the network using a source synthetic dataset with automatically generated images and labels, along with pseudo-labels from real, target images. Pseudo-labels are labels based on the network predictions deemed the most reliable, hopefully matching the unknown ground truth labels. However, since synthetic and real images always differ somehow, there is a domain gap that impairs the performance of the learned model on the target domain. UDA techniques aim to account for this domain shift in diverse ways, all revolving around how to best obtain numerous, accurate, and balanced pseudo-labels.
The second and third approaches are semi-supervised and active domain adaptation (SSDA, ADA). Instead of self-training solely with pseudo-labels, both rely on the possibility of obtaining from human annotators a small number of ground truth labels. In SSDA the set of all the annotated target samples is given from the start, and the aim is to make the most of them to obtain good pseudo-labels to retrain the model once [15,16,17,18]. Differently, the most distinctive trait of ADA is that the selection of the samples to be annotated is performed iteratively and actively [19,20,21,22,23]. Active in the sense that some scheme, named the acquisition function, selects the more valuable samples to annotate, i.e. those for which the chosen semantic segmentation metric has better chances to increase according to some heuristic. At each iteration, one runs the acquisition function, obtains a fraction of the budget annotations from the human-in-the-loop, and retrains the model. The rationale for iterating is that as cycles proceed, the model becomes more and more adapted to the target domain and the acquisition function can select better samples towards the final goal of optimizing the metric.
1.1 Motivation
In this work, we present a method to close the domain gap by using active domain adaptation that is conceptually simple and demands a low annotation time. State-of-the-art ADA methods either annotate a budget of whole images [21, 22], or of individual pixels [19, 20, 23], in both cases selected by the acquisition function among the pool of images of the target domain. This difference is key in practical terms. Let’s take Cityscapes as an example. Its training set, from which we annotate samples, has 2975 images of 1024 \(\times\) 2048 pixels. A budget of 1% means, in the first case, to fully annotate 30 images by manually delineating the contours of regions belonging to the classes of interest, a procedure that takes 1.5 h per image on average. However, annotating 1% of individual pixels (62 million) spread all over these images would take much more time: the annotator would have to view much more than 30 images and, in each one, decide the class of a huge number of pixels here and there, most probably near the frontiers of regions of different class where the uncertainty is higher. We argue that methods that require independent pixel annotations are not feasible in practice, and are only possible if fully labeled images of the target domain are already available.
The annotation of whole images, while feasible, has the problem of wasting a fraction of the allocated budget. Most frequent classes like Road, Sky, Vegetation, and Building appear in almost all images and are sufficiently similar in both domains. Consequently, they are quickly learned by the neural network. Annotating them will consume a portion of the budget that would be better spent on other classes that are more difficult to adapt. However, this is not possible because we are constrained to a tiny subset of selected images.
1.2 Our proposal
For these reasons, we propose to annotate not images or individual pixels but a reduced number of regions, which are automatically computed (instead of hand drawn) and may come from any image of the target domain. Moreover, by annotating a region, we mean to ask the human-in-the-loop to provide one class label corresponding to the majority of the pixels in a given region. Depending on the quality of the regions, there are unavoidable label errors in some pixels. By using a good region segmentation algorithm we can minimize them while, at the same time, obtaining a very cost/time-effective annotation strategy.
In this work, we propose a method with we name RADA (Region-based Active Domain Adaptation) towards closing the domain adaptation gap in semantic segmentation. It follows the classic scheme of active learning: it starts with a model pre-trained on synthetic data and then, for several rounds, an acquisition function selects a budget of samples to be annotated that are used to retrain the model. It has two main novelties with respect to previous works:
-
Samples are Segment Anything (SAM) masks (Fig. 3), each one to receive a unique label from the annotator. To the best of our knowledge, we are the first to employ annotated regions to perform semantic segmentation domain adaptation, since past works resort to either annotating whole images or individual pixels.
-
To the regions labeled with majority ground truth, we add “self-labeled regions”. They are other regions that we compare with the former and find them similar so that it is probably safe to propagate the label we have assigned to them. In this way, we almost double the number of annotated regions with only a slight number of labeling errors.
Performance versus budget and annotation time of state-of-the-art ADA methods and ours on the GTA5 \(\rightarrow\) Cityscapes benchmark. The model for all methods is DeeplabV3+ except for LabOR (V2), and the source is GTA5 or GSU, which is the union of GTA5, Synscapes and Urbansyn. Zero budget is the baseline of the model, that is, trained with synthetic source only. Equivalent annotation time applies only to methods relying on the annotation of whole images (MADA, MADAv2) or regions (ours)
The results of RADA surpass the state of the art (Fig. 1, blue line) but do not yet close the gap between the model trained with only synthetic source and with full supervision of the target domain. We have further improved the result in two ways. The first one is to simply extend the source data to include, besides GTA5, two more synthetic datasets, namely, Synscapes [24] and the recently introduced Urbansyn [25]. Moreover, replacing the DeeplabV3+ semantic segmentation architecture commonly used by all previous works with Segformer, a modern transformer-based one, produces an additional improvement.
Additionally, we present a variant that we name UDA + ADA consisting of applying the region annotation of the first method to the predictions made by a UDA method, which can label the most uncertain pixels as “unknown”, for instance, by simply thresholding the highest class probability. By building on top of a UDA method that already performs quite well, we are able to almost close the domain gap—in the case of Cityscapes and Mapillary Vistas datasets—at a very modest annotation budget (Fig. 1, green line).
1.3 Contributions
The contributions of this work to the problem of active domain adaptation for semantic segmentation are the following:
-
1.
We provide a practical method based on labeling regions, not pixels, with the annotation time in mind. For the sake of comparison with state-of-the-art, we convert the commonly used annotation budgets of 1%, 2.2%, and 5% of target images to time and, from there, to the number of SAM regions that can be annotated.
-
2.
For all the budgets, we obtain similar or better results than the state-of-the-art in the synthetic-to-real adaptation benchmark GTA5 \(\rightarrow\) Cityscapes.
-
3.
Besides this benchmark, we experiment with a combination of three synthetic datasets (GTA5, Synscapes, and Urbansyn) and a larger and more challenging target, Mapillary Vistas. Furthermore, the majority of past works used only one convolutional semantic segmentation model, DeeplabV3+. Recently introduced, transformer-based models for semantic segmentation have shown a boost in accuracy. We experiment also with one of such modern models, Segformer, and show it contributes to a significant improvement by itself.
-
4.
In addition to an iterative, human-in-the-loop ADA method, we propose a one-step variant to be applied to the output of a highly performing UDA method [26], and show this almost reaches the fully supervised upper bound for Cityscapes and Mapillary Vistas.
-
5.
We obtain competitive results even at a low annotation regime, specifically 0.25% and 0.5% budgets, that we explore for the first time.
2 Related work
In this section, we review the past approaches to domain adaptation. This is a broad subfield in computer vision. Hence, we focus on techniques applied specifically to semantic segmentation.
2.1 Unsupervised domain adaptation
Recent UDA methods are becoming better at narrowing the synthetic to real domain gap, as demonstrated in the two most common benchmarks, GTA5 \(\rightarrow\) Cityscapes and SYNTHIA \(\rightarrow\) Cityscapes. The most successful approaches [9,10,11,12,13,14] are model-centric (methods that modify and adapt existing neural network architectures) and use a self-training strategy in combination with the teacher-student paradigm. Among them, the landmark work by Hoyer et al. [11] proposed DAFormer, the first robust UDA method based on visual transformers, a teacher-student self-training approach plus several specialized training strategies. For example, the strategy of random class Sub-sampling to cope with class imbalance in GTA5 and SYNTHIA and the so-called thing-class ImageNet feature distance whereby the weights of class features from ImageNet pretraining are maintained during the training. Subsequent works by the same authors have been directed towards improving DAFormer by adding further components on top of their framework. One is a multi-resolution image cropping module, named HRDA [12], that captures long-range context dependencies. Another is MIC [13], a masked image consistency module to learn spatial context relationships in the target domain. Right after it, Chen et al. [14] have improved MIC by exploring the pixel-to-pixel and patch-to-patch relations for regularizing the segmentation feature space.
All these works based on the landmark DAFormer method are piled on top of each other and tailored to the GTA5 \(\rightarrow\) Cityscapes case. They thus become increasingly complex, suffer from high training times, and rely on ad-hoc strategies to solve the issues of the synthetic data. Hence, generalizing these UDA model-centric methods to new synthetic sources and real targets requires time-consuming hyperparameter exploration, and reliability is not guaranteed.
In order to circumvent this kind of complexity in the UDA + ADA variant, we have resorted to multi-source synthetic data. Several works [26,27,28,29] demonstrate the efficacy and reliability of multi-source data for the UDA problem. In particular, Gómez et al. [26] proposed a data-driven method that treats architectures as black boxes and outputs refined pseudo-labels, resulting in a good candidate to apply our ADA method.
2.2 Active domain adaptation
Motivated by the limitations of UDA techniques at the time of its publication, LabOR [19] is perhaps the first attempt at performing active learning tailored to domain adaptation of semantic segmentation models. The key idea of LabOR (label only if required) is to train, along with a regular semantic segmentation model, another model with the sole purpose of selecting the pixels in the target images to be labeled by the human-in-the-loop. This second model has a common backbone and two heads that provide, for an input target image, different predictions. This is thanks to a loss term that maximizes the discrepancy between the two predictions. The annotator is asked to provide labels for a percentage of the pixels where the two predictions disagree. While the results were much better than UDA techniques at the time, and at a modest budget of 2.2% of target pixels, this method defines a total of five different losses to optimize jointly. Additionally, it involves training the two models at the same time. Our method, in contrast, sticks to the regular cross-entropy and trains a single model.
RIPU [20], which stands for region impurity and prediction uncertainty, annotates a small portion of image regions, those most uncertain and diverse. Diversity is measured as the entropy of the normalized frequencies of class predictions within a region. Uncertainty is estimated as the mean value of the entropy of the class probabilities of the predictions within a region. While the idea of combining these two measures on regions is appealing, the regions turn out to be simply 3\(\times\)3 neighborhoods, not regions related to the semantics of the image, like in superpixels or SAM. In addition, the ground truth is required for each pixel separately, thus there is not any reduction in the number of annotations by the fact of using regions, as we do.
The key idea of D2ADA [23] is to minimize the number of informative samples in the domain adaptation process by acquiring labels of samples in high probability density zones in the target domain yet low density in the source domain. This heuristic is complemented with an uncertainty estimation computed as the entropy of the predictions, like in [20]. For the sake of efficiency, D2ADA uses superpixels (as computed by the SLIC algorithm, like in Fig. 2) in order to estimate the uncertainty and the probability densities of source and target domains through a Gaussian mixture model. Once the most interesting regions have been selected, the method asks for the ground truth of each of its pixels. They missed again the chance to reduce the number of annotations if the method assumed all the pixels in a region shared the same label.
The MADAv2 method [22] is a refinement of MADA [21]. The main novelty with respect to the three previously reviewed works is that samples are not pixels anymore but images: the method is based on annotating whole images. Active learning means then how to select the best available target images to be fully labeled. To this end, given a source image, they compute the mean feature vector for all the pixels belonging to a certain ground truth class and then concatenate the vectors for every class as a representation of the source image. Next, they perform clustering by k-means to obtain cluster centroids or anchors. The same is done for the images in the target dataset but using class predictions. Finally, they select the target images to annotate as those at the largest distance to the computed source and target anchors, which means the images are most complementary to the source domain and at lower density zones in the embedding of the target domain. MADAv2 achieves a performance similar to D2ADA (71.4% vs. 71.3% mIoU at 5% budget, respectively, see Table 2 and Fig. 1), with the advantage that annotation is carried out over a budget of whole images, not individual pixels.
The most recent work are we aware of is [30]. Similarly to [22] they compute a centroid for each class and domain as the average of the model feature vectors of the pixels belonging to or predicted as this class. The main novelty is that they iteratively align the source and target prototypes, allowing them to compute reliable pseudolabels. In parallel, the same region impurity measure proposed in [20] is used to select the 3\(\times\)3 neighborhood pixels to be annotated. The best result achieved is +0.8% mIoU for MADAv2 at a budget of 5% of pixels in the target dataset. But again, the annotations are pixel-wise.
2.3 Vision foundation models
In addition, we leverage the vast knowledge that brings vision foundation models [31,32,33,34]. These models based on ViT [35] and Swin Transformer [36] are trained with millions of samples to generalize tasks and data distributions. Several works use text-to-image foundational models such as CLIP [31] and Stable Diffusion [37] to create synthetic data from real data semantic masks in order to improve the segmentation performance. Xie et al. [38] propose an automatic diffusion-based data augmentation pipeline to expand the existing instance segmentation datasets. Yang et al. [39] enhance fully-supervised semantic segmentation via generating densely annotated synthetic images with generative models and propose a robust filtering criterion to suppress noisy synthetic samples at the pixel and class levels.
In this work, we leverage the Segment Anything Model (SAM) [33]. It is a recently introduced semantic segmentation foundation model with an exceptional capacity to produce regions that lack semantic labels but have accurate boundaries, mostly coincident with the scene objects/class regions as desired and demonstrated by Fig. 3. Trained on a large dataset with more than one billion masks in 11 million images, it also has a remarkable zero-shot segmentation performance. Several recent methods leverage the SAM foundational model, improving and adding new capabilities. Xiong et al. [40] propose a lightweight ViT based on SAM to reduce complexity and improve real-time applicability on visual tasks. More in detail, they leverage MAE [32] pretraining method with SAM model to obtain high-quality pretrained ViT encoders. Similarly, Yuan et al. [41] combine CLIP [31] encoder and SAM decoder to propose a distillation method that upgrades SAM with the capability to provide semantic labels. Zhou et al. [42] optimize SAM to be real-time, even on edge devices, in a similar fashion as [40]. They propose a CNN-based backbone and a distillation method with a strategic selection of prompts. Li et al. [43] propose an improved SAM that responds better to different granularity levels and introduces semantic labels. In our work, we use SAM only as a promptable model because it has been designed to allow for interactive use, being suitable for our active learning proposal.
3 Method
In this section, we explain in detail the two methods we propose, which are based on cycles of region selection under a budget, region pseudolabeling, and model retraining. We also present a variant consisting of a UDA step followed by just one of such cycles. But first, we address two important practical aspects: how are the regions are we using (i.e. how are they computed), and how do we define the budget to be comparable to previous works?
3.1 Annotable regions
The kind of regions we are looking for are
-
Semantically homogeneous, that is, most if not all of the pixels in a region should belong to one same class, thus reducing the number of noisy labels when assigning a single label to all pixels within a region
-
Their contours correspond as much as possible to class frontiers
-
They do not over segment a region of one same class, but adapt to the object’s shape to allow for an efficient use of the annotation budget.
Superpixels computed by algorithms like SLIC [44] or SEEDS [45] fail to satisfy the second and third conditions to a great extent, as shown in Fig. 2. In this work, we leverage the aforementioned Segment Anything Model (SAM) [33] and its promptability. Prompts can be either a set of foreground/background points, a box, or a binary mask of a free-form text. We have opted for the most generic prompt, a regular grid of 32 \(\times\) 32 foreground points for both Cityscapes and Mapillary datasets. Furthermore, we have not tuned the rest of the parameters of SAM (IoU threshold. stability score) but maintained the default values except for the minimum area of a region which we have set to 1000 pixels to avoid an excessive number of regions that would only increase the subsequent processing time. At an image size of 1024 \(\times\) 2048 in Cityscapes and 1216 \(\times\) 1632 pixels in the resized frames of Mapillary Vistas, we obtain, on average, 148 and 106 regions per image, respectively.
We intend to annotate a region with the majority ground truth label of its pixels, as judged by a human annotator. In doing so, even if the label is right, we may be introducing errors since not all the pixels may belong to the same class. However, we have found this error is quite small: the mean per class accuracy of this kind of labeling is 98% both in Cityscapes and Mapillary Vistas, with a mean (again per class) precision of 96 ± 3%, 93 ± 7%, and recall 96 ± 3%, 96 ± 4% in these two datasets.
3.2 Annotation budget
In the context of ADA, the budget is the total number of samples that an oracle will annotate, expressed as a percentage of the size of the training split in the target domain. As discussed before, in our opinion, the only practical ADA methods for domain adaptation in semantic segmentation are, until now, those that take a whole image as a sample. Common budgets for GTA5 to Cityscapes adaptation are 1, 2.2, and 5%, thus 30, 66, and 150 fully annotated images. Since our samples are regions instead, we have to convert these budgets into an equivalent number of regions. But equivalent how? Since what matters is the cost of annotations, the answer is in terms of annotation time. Given that annotating a whole image acquired in normal lighting conditions takes on average 1.5 h [6, 7], we only need to estimate how much time it takes, on average, to annotate a single region.
To do so, we have built the simple interface of Fig. 4 to annotate randomly selected regions and record the individual annotation times. Three participants were asked to provide the majority class for the pixels within a region. Horizontal and vertical yellow lines help to quickly locate the mask in the image when it is small. The annotator has only to click on the palette of classes to indicate the chosen label for the outlined region. He/she learns the meaning of the 19 classes during a “warm-up” phase, in which we show the right class at the top left corner of the image as feedback. After that, we record the mean annotation time for an equal fixed number of regions per class (20), to eliminate the influence of class imbalance and also to account for the varying degree of difficulty depending on the class (users tend to confuse more certain classes like poles and traffic light, fence and wall). Overall, we have obtained a mean and standard deviation of 3.4 ± 1.2 and 3.3 ± 1.0 s per region in Cityscapes and Mapillary Vistas, respectively, with an accuracy of 98%. Given this result, we have finally set the time per mask to the integer upper bound of 5 s, which yields the number of regions and equivalent annotation times in Table 1.
Note that in addition to the budgets of 1, 2.2, and 5% of samples used by previous works, we consider budgets less than 1% for the first time. The 2.2% budget was introduced in [19], one of the first works in ADA for semantic segmentation, and has been kept in subsequent works for the sake of comparison. 1.7% and 1.56% are special budgets related to UDA + ADA in which all the regions in the target training split with at least 50% pixels predicted as “unknown” are selected to be annotated.
3.3 RADA: region-based active domain adaptation
The first method we propose for ADA in semantic segmentation follows the common scheme of several iterations of sample selection, annotation, and model retraining with all the target samples annotated in this and previous iterations. The main differences are that (1) the samples are SAM regions, as commented above, (2) the way we select them, and (3) the addition of self-labeled regions. In this section, we delve into the second and third points and justify the main design decisions we have made. In the following, we employ the notation of Algorithms 1 and 2.
The starting point is a model \(\mathcal {M}(W_0)\) trained with synthetic source images and automatically generated labels. There is also a split of images I in the target domain, available for annotation. We have previously computed S, the SAM regions of each image in I. Now we have to select a certain number of regions \(b_r\) to annotate, and like previous works, we choose those for which the model prediction is most uncertain. We measure the uncertainty of a region as one minus the maximum of the mean class probabilities, this is later computed as the mean of the predicted class probabilities of its pixels. To do so, we must perform inference with the model \(\mathcal {M}(W_0)\) if we are in the first round or with the model learned in the previous round \(\mathcal {M}(W_{r-1})\) (block 2 in Algorithm 1).
Sorting the regions by decreasing uncertainty and annotating the first \(b_r\) ones would already make an ADA method. However, we leverage the fact that regions of the same class in the same or different images may have similar representations according to the learned model, despite the cross-entropy loss that does not explicitly enforce it. We have seen experimentally that the most similar regions to another one, in terms of its mean feature vector, have a high probability of belonging to the same class. Accordingly, one could compute the k-nearest neighbor regions to each annotated one and propagate its label to them. This naïf approach suffers from a “chain effect”: soon, after a few “hops”, the propagated label becomes a false positive. Instead, we have found that propagating ground truth labels after hierarchically clustering the regions produce fewer false positives, even compared to 1-nearest neighbor.
Specifically, we perform a type of hierarchical clustering known as agglomerative clustering [46, 47]. It produces a binary tree whose leaves are individual samples (i.e. regions), and parent nodes represent clusters of samples. It starts with every sample being a different cluster and then, in a bottom-up direction, keeps finding the pair of closest clusters to be joined into a new parent cluster until all the samples become descendants of this new parent, that is root of the tree. There are different metrics to measure the similarity of pairs of clusters, being complete linkage the one that produces a more balanced tree. Complete linkage simply means that the distance between two clusters is the maximum of the distances between a sample of the first cluster and one in the second.
Once we have the tree of clusters, we assign an uncertainty score to each one by averaging, from bottom to top, the uncertainty of the two children of each parent (block 4 of Algorithm 1). Now we are ready to annotate regions. We start with parents of exactly two leaves (line 1 of Algorithm 2), sort them by descending uncertainty, and for each one, we annotate one of the descendant leaves with the most frequent ground truth label of its region. We propagate this label to the rest of the descendant leaves of this parent (one in this case), thus adding a self-labeled region. We proceed in the same way with parents with three, four, etc., descendant leaves until a certain maximum number of descendant leaves is reached. Figure 5 illustrates this process.
Because of the problem of class imbalance, simply sorting the parents by uncertainty does not ensure that all the classes are represented in the annotated regions, either with ground truth or pseudo-labels, and this will most probably cause a problem to the domain adaptation process. Some action is needed in this regard to favor the annotation of rare classes. To this end, we rely on the majority voting of the predicted classes for the pixels in a region, because using the pixels ground truth is not an option given the budget limit. When labeling a descendant leaf of a parent, we assign an equal share of the budget to each predicted class (block 2, Algorithm 2). Once this share is used for all classes, in the event of not having exhausted yet the budget of one round, it is spent annotating with ground truth randomly sampled regions among those not yet annotated (block 3, Algorithm 2).
Figure 6 shows the annotations on one Cityscapes frame for three increasing budgets. Each image includes regions annotated with the majority ground truth label and also self-labeled regions, i.e. where the ground truth label of another region has been propagated.
3.4 Implementation details
The algorithm requires that at each round r, one has to run the inference with the model \(\mathcal {M}(W_{r-1})\) learned in the previous round for each image in the training split of the target domain, denoted by I. This is a bottleneck, since inference takes around 2 s per image and I is potentially large, in our case 2975 and 14,600 frames for Cityscapes and Mapillary Vistas, respectively. A second practical problem is that at small budgets the number of annotated regions per image can be very low. For instance, at 1% only 11 regions on average get a ground truth annotation, plus a similar number self-labeled. At 0.25% budget, an image may not have any region annotated at all. We have observed that this causes a problem with the training of a model because for a minibatch of two images, as we have, the gradient is too small. We avoid these two undesirable effects by randomly sampling a subset of images in I to be annotated, 1000 and 7500 images in these two datasets, respectively, when the budget is less than 5%.
The agglomerative clustering algorithm has complexity \(O(n^3)\) [47], which makes it infeasible in practice if the number of regions to cluster surpasses 100K. A simple solution to this problem has been to shuffle the set of candidate regions to annotate and split them into batches of at most 90K regions. Then, the time to cluster a batch amounts to just a few minutes. In total, clustering the approximately 147K regions in 1000 images of Cityscapes takes around 20 min, and the 440K regions in the 2975 images of the whole Cityscapes train split takes 1 h 10 min. Also, to avoid having to compute the distance between every pair of regions, we compute the graph of the 10 nearest neighbors to each region and pass it to the clustering algorithm. The implementation used is that provided in Scikit-learn [48].
3.5 Combining UDA with ADA
The motivation for this variant lies in the fact that the best unsupervised domain adaptation methods are near the fully supervised upper bound, at least on the much-worked benchmark of GTA5 to Cityscapes. For instance, the transformer-based architectures of DAFormer [11] and HRDA [12] attain 68.7% and 73.8% mIoU, respectively, versus an upper bound of 81.5%. The combination of the co-training based UDA method [26] and the three synthetic datasets GTA5, Synscapes, and Urbansyn raise the mIoU to 76.7% versus an upper bound of 79.8% with the convolutional network DeeplabV3+, thus reducing the gap to a mere 3.1%. Yet, they do not close it. To try to cover the “last mile” we thought of taking as a starting point for active domain adaptation the model resulting from a UDA technique and see if this could close the gap.
Specifically, we selected the UDA method proposed on [26]. While many UDA methods provide the model weights as the only final output and use pseudo-labels internally, with this method, we can also obtain reliable pseudo-labels as output along with an uncertainty score. It is a purely data-driven co-training procedure using a DeeplabV3+ architecture, where two networks improve iteratively by sharing the most relevant pseudo-labels between them. After several iterations, the method combines the knowledge of these two networks to generate as final output, pseudo-labels from the target domain where pixels with lower confidence are labeled as “Unknown” class. The baselines in Table 5 come from training end-to-end models DeeplabV3+ and Segformer with synthetic data combined with the pseudo-labels generated by [26].
We selected as candidate regions to be annotated, those SAM masks with at least half of the pixels predicted as “Unknown”, because we consider them the more uncertain. This set of regions then takes the role of \(S_r\) in Algorithm 1. Finally, we train our UDA + ADA model with the renewed pseudo-labels in the same fashion as the final step in [26]. Training is done from scratch: we discard the baseline learned previously during the UDA phase with the synthetic datasets, that is used just to compute the predictions that include pixels labeled as “unknown”. We use ImageNet weights on DeeplabV3+ and Segformer B5 and combine at batch-level, synthetic and real images with the corresponding generated and annotated ground truth. Figure 7 shows the kind of annotations we obtain from the predictions generated by the UDA model.
4 Experiments
4.1 Datasets
Our experiments rely on two well-known synthetic datasets used for UDA semantic segmentation as source data, namely, GTA5 [49] and Synscapes [24]. GTA5 is composed of 24,904 images with a resolution of 1914 \(\times\) 1052 pixels directly obtained from the render engine of the video game GTA5. Synscapes is composed of 25,000 images with a resolution of 1440 \(\times\) 720 pixels of urban scenes, obtained by using a physic-based rendering pipeline. In addition, we combine them with UrbanSyn [25], which is composed of 7539 images with a resolution of 2048 \(\times\) 1024 pixels obtained from semi-procedurally generated synthetic urban driving scenarios.
As real-world datasets we have chosen Cityscapes [6] and Mapillary Vistas [7]. Cityscapes is a popular dataset composed of on-board images of 2048 \(\times\)1024 pixels acquired at different cities in Germany, comprising 2975 images for training and 500 images for validation that we use for testing as is common practice in UDA works. Mapillary Vistas is composed of high-resolution images of street views around the world. These images exhibit many different dimensions and aspect ratios because they have been acquired by diverse camera devices such as smartphones, tablets, professional cameras, etc. We only consider those images with an aspect ratio of 4:3. This is a constraint imposed by the DeeplabV3+ architecture, which needs the input images to have the same aspect ratio. In consequence, we have chosen the 4:3 ratio, which makes up more than 75% of the dataset. Accordingly, we used a total of 14,716 images for training and 1617 for testing. Cityscapes, like the synthetic datasets, consider 19 classes (see, for example, Table 3). Mapillary Vistas was originally labeled with +60 classes. However, we restrict to the subset of 19 Cityscapes classes and change the labels of the other classes to the ignore label.
4.2 Experimental settings
All the experiments of RADA were implemented in the semantic segmentation framework MMsegmentation of Openmmlab [50]. The learning rate, scheduler, and minibatch size were kept fixed for the two target datasets but were different depending on the network. For DeeplabV3+, the backbone is a Resnet101, the learning rate 2e−03 and we used a batch size of 8 images. The optimizer is a simple stochastic gradient descent with momentum 0.9 and the learning rate scheduler is polynomial with \(\gamma =0.9\). For Segformer we have chosen the B5 variant, the batch size is 2, and the learning rate is 2e−05. The optimizer is ADAMW and the learning rate scheduler is polynomial with \(\gamma =1.0\). All these parameters were the same or just slightly different than those used in the MMsegmentation framework for the two target datasets and networks. All images of Mapillary Vistas have the same horizontal to vertical ratio of 1.33 but different dimensions, so we resized them to 1216\(\times\)1632.
Regarding UDA + ADA, in the experiments, we perform synth-to-real LAB space alignment [26, 29] to reduce visual domain gaps and replicate the mini-batch training strategy to ensure the same probability of selection per image, regardless of the image source. DeeplabV3+ models are trained for 90K iterations with SGD optimizer with a starting learning rate of 0.002 and momentum of 0.9. For Cityscapes and Mapillary Vistas, we crop the training images to 1024 \(\times\) 512 pixels and 1280 \(\times\) 720; and batch sizes are set to 8 and 16 images, respectively. SegFormer models are trained for 180K iterations with a batch size of 2 in all configurations, using the same hyper-parameters, optimizer, and scheduler reported in [51], with the same cropping adjustments as for the DeeplabV3+ baselines.
4.3 Results
4.3.1 Comparison with state of the art
All previous works on ADA for semantic segmentation present results exclusively for the benchmark GTA5 to Cityscapes using the Deeplab model. For a fair comparison, we have computed the results of RADA in this same scenario, which are summarized in Table 2 and Fig. 1 (blue line). We can observe that ours is better than the best competitor by 0.5% at an equivalent budget of 5%. Moreover, with a budget of just 3% we almost reach the best result of previous works at 5% budget, just short of 0.3% mIoU. We defer the discussion of the results with UDA + ADA, better by large, to a paragraph below.
Table 3 details the per class IoU of RADA and UDA + ADA versus state-of-the-art. For most classes, the performance of RADA is similar to the closest competitor Prototype-guided [30], within a margin of ± 3%. Exceptions are traffic sign (− 6%) and several classes for which the difference is much higher: terrain + 11%, bus + 6%, truck + 4%, which are the reason for the higher mean IoU. To a lesser extent, this also happens at a budget of only 3%.
4.3.2 Further improvements
In order to further advance the state-of-the-art in domain adaptation for semantic segmentation we have followed two directions. The first is to add to the source GTA5 two more synthetic datasets, namely, Synscapes [24] and Urbansyn [25], to form a new dataset that we name GSU. As a consequence, not only do we have more data to train but also its diversity and degree of realism are higher than with GTA5 alone. As we can appreciate in Table 4, this change boosts the mean IoU of the baseline model from 34 to 63.1%, thus 29 points, which is a large increment. The second is to replace the convolutional model DeeplabV3+ with SegformerB5, a higher capacity model based on visual image transformers. It adds +13% to the DeeplabV3+ - GTA5 baseline, and +8.8% in the case of the extended dataset GSU. The total contribution of the two improvements is +38%, more than doubling the GTA5 − DeeplabV3+ baseline.
We still have to add to the mIoU of the old (34%) and new baselines (63.1, 47.2, 71.9%) the contribution of RADA for the different budgets. Again, we refer the reader to Table 4, where two main results arise. The first is that for a budget of 5%, the gap to the fully supervised upper bound narrows down to just 4.6% for DeeplabV3+ and 3.0% mIoU for Segformer. Second is that we close a large part of the gap between baseline and upper bounds: 69–85% of the gap gets covered depending on the data source and model employed. The results for all budgets are shown graphically in Fig. 1 for DeeplabV3+ and Fig. 8 for Segformer.
4.3.3 UDA + ADA
This variant is not directly comparable to the state-of-the-art because its UDA part takes as source GSU. However, this is the only difference. We can appreciate in Table 2 and Fig. 1 (green line) that it obtains a much better result than the best method in state-of-the-art at 5%, and this with a tiny budget of 0.25%. We hypothesize that this is due to the sophisticated UDA step, which, through several self-training cycles, provides us a very good baseline, with more than 76% mIoU in GSU to Cityscapes. Moreover, we obtain a reliable selection of the most uncertain pixels from which we decide the corresponding regions to annotate. UDA + ADA is the best for the target Cityscapes and almost closes the gap in the case of DeeplabV3+, being at a distance of only 0.8 points to the upper bound (Table 5). It surpasses Prototype-guided [30] +6.7% mean IoU with just one-third of its budget (Table 2). Finally, Table 3 contains the IoU for each individual class, where we can appreciate that UDA + ADA achieves the highest IoU in every class but Road.
Performance of RADA versus state of the art UDA method HRDA and upperbound. The source is GTA5 only in “Ours” and GTA5 plus Synscapes and Urbansyn in “Ours GSU”. In all cases the target is Cityscapes and the model is SegformerB5. Like in Fig. 1 zero budget is the baseline of the model
4.3.4 A new target
Performance of RADA and UDA + ADA on Mapillary Vistas. Like in Fig. 1 the zero budget baseline is, for RADA the model trained with source only, and for UDA + ADA the performance of the UDA model
Having assessed the two methods on the Cityscapes, we wanted to check if they can also attain good results on another, more difficult target. After all, Cityscapes has only 3000 frames for training, and testing is based only on the 500 frames of its validation split. Moreover, these images are mostly from urban zones in three German cities which surely limits the variability of content and aspect. Among the several public datasets of driving sequences, we have chosen Mapillary Vistas for its larger size and, more importantly, its diversity because it contains images acquired worldwide with a variety of cameras, making it a challenging dataset when adapting from a synthetic source.
The results for the two models are shown in Fig. 9 and numerically in Tables 4 and 5. RADA with Segformer at 5% budget closes most of the gap of 13 points between baseline and the fully supervised upper bound, reaching 76.4% mIou, thus only 3% below it (Table 4). Furthermore, with less than half this budget, at 2.2%, the mIou is practically the same, 76%. In contrast, the gap of 8.7% with DeeplabV3+ is almost three times bigger. We believe that the large amount of training data in this dataset leverages the higher capacity of Segformer when compared to that of DeeplabV3+.
4.3.5 Which is better, RADA or UDA + ADA ?
There is not a categorical answer. The results obtained show that it depends on the model employed and the characteristics of the dataset. For DeeplabV3+ it is clear that UDA + ADA is better than RADA in the two target datasets. However, with Segformer RADA obtains similar results to UDA + ADA on Cityscapes and better on Mapillary. We hypothesize this is due to the availability of more training images are available, i.e. the larger size of Mapillary and budgets higher than 1 or 2.2%.
One puzzling observation is that in UDA + ADA, Segformer underperforms DeeplabV3+ for both datasets when normally the former one should be better. As this happens in all the UDA + ADA experiments, we believe that the pseudo-labels from the UDA method and the training step (with default hyper-parameters) are not the most suitable for the Segformer network. In fact, UDA methods from the literature do not use Segformer as it is but always customize the attention head like DAFormer [11] does.
4.3.6 SAM regions versus SLIC superpixels
In Sect. 3, we have discussed the merits of SAM regions with regard to SLIC superpixels as regions to annotate. But how much better are them quantitatively? Table 6 compares the mean and per-class IoU of RADA employing SAM regions versus SLIC superpixels. For all classes, SAM regions outperform SLIC superpixels. Classes like Road, Sidewalk, Wall, Vegetation, and Terrain, whose shape is rather simple, have IoU differences of one digit. But classes corresponding to thin, elongated shapes like Traffic sign and light, Person, Rider, or with complex shapes like Bicycle and Motorcycle, have a two-digit difference. In the selected setting, the difference in mean IoU is quite large, more than 16%.
4.3.7 Qualitative results
Figure 10 shows qualitative results for RADA and UDA + ADA on DeepLabV3+ and Segformer, with a budget of 1%, on Cityscapes and Mapillary Vistas. Overall, UDA + ADA provides a better adaptation to object shapes at farther distances than RADA because the UDA pseudo-labels are a stronger starting point. This difference is observable on the error maps in the sixth row in Fig. 10, where green denotes UDA + ADA being correct and RADA being wrong and red otherwise. We observe green between boundaries of different objects at far distances, meaning that UDA + ADA provides a better shape segmentation. In fact, the SAM region selection step can not focus on far-distance regions, and with only 1% of the budget, we do not compensate for the initial knowledge brought using the UDA pseudo-labels. Moreover, UDA + ADA responds better in larger classes such as Sidewalk and Truck for a similar reason , which is appreciated in the error maps of the first and third images in Fig. 10. Baselines trained on synthetic data without any UDA applied have problems in these classes, and only 1% of the budget is insufficient.
Qualitative results on two frames of Cityscapes (columns 1 and 2) and Mapillary Vistas (columns 3, 4) at a budget of 1%. White pixels in ground truth mean “ignore” or “void” class. The top and bottom rows of the Mapillary images have been cropped for the sake of better visualization. The sixth row provides a direct image difference where green means UDA + ADA is correct and RADA is wrong, red means the opposite, and yellow where both methods are wrong
4.4 Limitations and errors
Several sources of errors have an influence on our method. Firstly, SAM regions are quite but not completely semantically homogeneous, meaning that not all pixels of a region always belong to a unique class. Secondly, the pseudolabels at the region level produced by label propagation through hierarchical clustering also contain a fraction of errors (see appendix D). Last, we are assuming that the human oracle produces perfect annotations when labeling a region with the majority class of its pixels he/she sees.
A practical limitation of our implementation is that of computing the SAM regions when datasets are large. In effect, calculating the regions takes some tens of seconds with the largest (ViT-H) SAM model. However, since the original publication [33], faster variants have been proposed [40, 43] that can speed up this step.
5 Conclusions
In this work, we present two variants of a method for active domain adaptation aimed at semantic segmentation. They have in common the novelty of annotating regions segmented with SAM, a foundational model, with one label per region instead of pixels or entire images as in previous works. This allows incrementing the number of annotated regions by propagating these labels to similar regions as computed by a clustering algorithm, thus producing pseudo-labeled regions with high accuracy. Thanks to this we have set a new state-of-the art in the benchmark GTA5 to Cityscapes. Also, we show that the addition of more synthetic data and the use of a higher capacity model can boost the result. For target Cityscapes, with an equivalent budget of 5%, we reduce the gap with the model trained with full supervision to a mere 0.8% mIoU. We believe this work may facilitate the development of stronger methods of active domain adaptation for semantic segmentation, for example by taking as a starting point better UDA methods. Our experiments with the union of several synthetic datasets as source and taking as target more challenging datasets like Mapillary Vistas, may also contribute to the advancement of the field.
Data availability
All the employed synthetic and real datasets are public. Cityscapes is freely available to academic and non-academic entities for non-commercial purposes at https://www.cityscapes-dataset.com/. Mapillary Vistas and Urbansyn are publicly available under license Creative Commons Attribution NonCommercial Share Alike (CC BY-NC-SA) at https://www.mapillary.com/dataset/vistas and https://www.urbansyn.org. GTA5 is publicly available at https://download.visinf.tu-darmstadt.de/data/from_games/. The Synscapes dataset is provided free of charge to academic and non-academic entities for research purposes at https://synscapes.on.liu.se/.
References
Qureshi, I., Yan, J., Abbas, Q., Shaheed, K., Riaz, A.B., Wahid, A., Khan, M.W.J., Szczuko, P.: Medical image segmentation using deep semantic-based methods: a review of techniques, applications and emerging trends. Inf. Fus. 90, 316–352 (2023)
Hurtado, J.V., Valada, A.: Semantic scene segmentation for robotics. In: Iosifidis, A., Tefas, A. (eds.) Deep Learning for Robot Perception and Cognition, pp. 279–311. Academic Press, London, UK (2022)
Kemker, R., Salvaggio, C., Kanan, C.: Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS J. Photogramm. Remote Sens. 145, 60–77 (2018). https://doi.org/10.1016/B978-0-32-385787-1.00017-8
Muhammad, K., Hussain, T., Ullah, H., Ser, J.D., Rezaei, M., Kumar, N., Hijji, M., Bellavista, P., Albuquerque, V.H.C.: Vision-based semantic segmentation in scene understanding for autonomous driving: recent achievements, challenges, and outlooks. ITS 23(12), 22694–22715 (2022). https://doi.org/10.1109/TITS.2022.3207665
Janai, J., Güney, F., Behl, A., Geiger, A.: Computer vision for autonomous vehicles: problems, datasets and state of the art. Found. Trends Comput. Graph. Vis. 12(1–3), 1–308 (2020). https://doi.org/10.1561/0600000079
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016). https://doi.org/10.1109/CVPR.2016.350
Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4990–4999 (2017). https://doi.org/10.1109/ICCV.2017.534
Sakaridis, C., Dai, D., Van Gool, L.: ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10765–10775 (2021). https://doi.org/10.1109/ICCV48922.2021.01059
Tranheden, W., Olsson, V., Pinto, J., Svensson, L.: DACS: Domain adaptation via cross-domain mixed sampling. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021). https://doi.org/10.1109/WACV48630.2021.00142
Gao, L., Zhang, J., Zhang, L., Tao, D.: DSP: Dual soft-paste for unsupervised domain adaptive semantic segmentation. In: ACM Multimedia (2021). https://doi.org/10.1145/3474085.3475186
Hoyer, L., Dai, D., Van Gool, L.: DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9914–9925 (2022). https://doi.org/10.1109/CVPR52688.2022.00969
Hoyer, L., Dai, D., Van Gool, L.: HRDA: Context-aware high-resolution domain-adaptive semantic segmentation. In: European Conference on Computer Vision (2022). https://doi.org/10.1007/978-3-031-20056-4_22
Hoyer, L., Dai, D., Wang, H., Van Gool, L.: MIC: Masked image consistency for context-enhanced domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023).https://doi.org/10.1109/CVPR52729.2023.01128
Chen, M., Zheng, Z., Yang, Y., Chua, T.-S.: PiPa: Pixel-and patch-wise self-supervised learning for domain adaptative semantic segmentation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 1905–1914 (2023). https://doi.org/10.1145/3581783.3611708
Wang, Z., Wei, Y., Feris, R.S., Xiong, J., Hwu, W.-M.W., Huang, T.S., Shi, H.: Alleviating semantic-level shift: a semi-supervised domain adaptation method for semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 4043–4047 (2020). https://doi.org/10.1109/CVPRW50498.2020.00476
Alonso, I., Sabater, A., Ferstl, D., Montesano, L., Murillo, A.C.: Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8199–8208 (2021). https://doi.org/10.1109/ICCV48922.2021.00811
Chen, S., Jia, X., He, J., Shi, Y., Liu, J.: Semi-supervised domain adaptation based on dual-level domain mixing for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11013–11022 (2021). https://doi.org/10.1109/CVPR46437.2021.01087
Kechaou, M., Alaya, M.Z., Hérault, R., Gasso, G.: Adversarial semi-supervised domain adaptation for semantic segmentation: a new role for labeled target samples (2023). arXiv:2312.07370
Shin, I., Kim, D.-J., Cho, J.W., Woo, S., Park, K., Kweon, I.S.: Labor: Labeling only if required for domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8568–8578 (2021). https://doi.org/10.1109/ICCV48922.2021.00847
Xie, B., Yuan, L., Li, S., Liu, C., Cheng, X.: Towards fewer annotations: Active learning via region impurity and prediction uncertainty for domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, pp. 8058–8068 (2022). https://doi.org/10.1109/CVPR52688.2022.00790
Ning, M., Lu, D., Wei, D., Bian, C., Yuan, C., Yu, S., Ma, K., Zheng, Y.: Multi-anchor active domain adaptation for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9112–9122 (2021). https://doi.org/10.1109/ICCV48922.2021.00898
Ning, M., Lu, D., Xie, Y., Chen, D., Wei, D., Zheng, Y., Tian, Y., Yan, S., Yuan, L.: MADAv2: advanced multi-anchor based active domain adaptation segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 45(11), 13553–13566 (2023). https://doi.org/10.1109/TPAMI.2023.3293893
Wu, T.-H., Liou, Y.-S., Yuan, S.-J., Lee, H.-Y., Chen, T.-I., Huang, K.-C., Hsu, W.H.: D2ADA: Dynamic density-aware active domain adaptation for semantic segmentation. In: European Conference on Computer Vision, pp. 449–467 (2022). https://doi.org/10.1007/978-3-031-19818-2_26
Wrenninge, M., Unger, J.: Synscapes: a photorealistic synthetic dataset for street scene parsing (2018). arXiv:1810.08705
Gómez, J.L., Silva, M., Seoane, A., Borrás, A., Noriega, M., Ros, G., Iglesias-Guitian, J.A., López, A.M.: All for one, and one for all: UrbanSyn dataset, the third musketeer of synthetic driving scenes (2023). arXiv:2312.12176
Gómez, J.L., Villalonga, G., López, A.M.: Co-training for unsupervised domain adaptation of semantic segmentation models. Sensors (2023). https://doi.org/10.3390/s23020621
Gong, R., Dai, D., Chen, Y., Li, W., Van Gool, L.: mDALU: Multi-source domain adaptation and label unification with partial datasets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021). https://doi.org/10.1109/ICCV48922.2021.00875
Zhao, S., Li, B., Yue, X., Gu, Y., Xu, P., Hu, R., Chai, H., Keutzer, K.: Multi-source domain adaptation for semantic segmentation. Adv. Neural Inform. Process. Syst. (2019). https://doi.org/10.5555/3454287.3454942
He, J., Jia, X., Chen, S., Liu, J.: Multi-source domain adaptation with collaborative learning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021). https://doi.org/10.1109/CVPR46437.2021.01086
Peng, J., Sun, M., Lim, E.G., Wang, Q., Xiao, J.: Prototype guided pseudo labeling and perturbation-based active learning for domain adaptive semantic segmentation. Patt. Recognit. 148, 110203 (2024). https://doi.org/10.1016/j.patcog.2023.110203
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., : Learning transferable visual models from natural language supervision. In: International conference on machine learning, vol. 139, pp. 8748–8763 (2021)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15979–15988 (2022). https://doi.org/10.1109/CVPR52688.2022.01553
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., Dollár, P., Girshick, R.B.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3992–4003 (2023). https://doi.org/10.1109/ICCV51070.2023.00371
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision (2023). arXiv:2304.07193
Dosovitskiy, A.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv:2010.11929
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9992–10002 (2021). https://doi.org/10.1109/ICCV48922.2021.00986
R Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10674–10685 (2022). https://doi.org/10.1109/CVPR52688.2022.01042
Xie, J., Li, W., Li, X., Liu, Z., Ong, Y.S., Loy, C.C.: MosaicFusion: diffusion models as data augmenters for large vocabulary instance segmentation. Int. J. Comput. Vis. (2024). https://doi.org/10.1007/s11263-024-02223-3
Yang, L., Xu, X., Kang, B., Shi, Y., Zhao, H.: Freemask: synthetic images with dense annotations make stronger segmentation models. Adv. Neural Inform. Process. Syst. 36, 18659–18675 (2024)
Xiong, Y., Varadarajan, B., Wu, L., Xiang, X., Xiao, F., Zhu, C., Dai, X., Wang, D., Sun, F., Iandola, F., : EfficientSAM: Leveraged masked image pretraining for efficient segment anything. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16111–16121 (2024). https://doi.org/10.1109/CVPR52733.2024.01525
Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K., Loy, C.C.: Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively. In: European Conference on Computer Vision (2024). https://doi.org/10.1007/978-3-031-72775-7_24
Zhou, C., Li, X., Loy, C.C., Dai, B.: EdgeSAM: prompt-in-the-loop distillation for on-device deployment of SAM (2023). arXiv:2312.06660
Li, F., Zhang, H., Sun, P., Zou, X., Liu, S., Yang, J., Li, C., Zhang, L., Gao, J.: Semantic-SAM: segment and recognize anything at any granularity. In: European Conference on Computer Vision (2024)
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Patt. Anal. Mach. Intell. 34(11), 2274–2282 (2012). https://doi.org/10.1109/TPAMI.2012.120
Bergh, M., Boix, X., Roig, G., Capitani, B., Van Gool, L.: Seeds: Superpixels extracted via energy-driven sampling. In: European Conference on Computer Vision, pp. 13–26 (2012). https://doi.org/10.1007/s11263-014-0744-2
Nielsen, F.: Introduction to HPC with MPI for Data Science. Undegraduate topics in Computer Science, pp. 195–211. Springer, Switzerland (2016). Chap. 8 Hierarchical clustering. https://doi.org/10.1007/978-3-319-21903-5
Tan, P.-N., Steinbach, M., Karpatne, A., Kumar, V.: Introduction to Data Mining, 2nd edn. Pearson, UK (2018). Chap. 7 Cluster Analysis: Basic Concepts and Algorithms. https://doi.org/10.5555/3208440
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: European Conference on Computer Vision, pp. 102–118 (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Contributors, M.: MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inform. Process. Syst. 34, 12077–12090 (2021)
Hu, H., Wei, F., Hu, H., Ye, Q., Cui, J., Wang, L.: Semi-supervised semantic segmentation via adaptive equalization learning. Adv. Neural Inform. Process. Syst. 34, 22106–22118 (2021)
Yang, L., Qi, L., Feng, L., Zhang, W., Shi, Y.: Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7236–7246 (2023).https://doi.org/10.1109/CVPR52729.2023.00699
Na, J., Ha, J.-W., Chang, H.J., Han, D., Hwang, W.: Switching temporary teachers for semi-supervised semantic segmentation. Adv. Neural Inform. Process. Syst. 36, 40367–40380 (2024)
Shin, W., Park, H.J., Kim, J.S., Han, S.W.: Revisiting and maximizing temporal knowledge in semi-supervised semantic segmentation (2024). arXiv:2405.20610
Hoyer, L., Tan, D.J., Naeem, M.F., Van Gool, L., Tombari, F.: SemiVL: semi-supervised semantic segmentation with vision-language guidance. In: European Conference on Computer Vision (2024). https://doi.org/10.1007/978-3-031-72933-1_15
Acknowledgements
Antonio M. López acknowledges the financial support to his general research activities given by ICREA under the ICREA Academia Program. All authors acknowledge the support of the Generalitat de Catalunya CERCA Program and its ACCIO agency to CVC’s general activities.
Funding
Open Access Funding provided by Universitat Autonoma de Barcelona. This research has been supported by Grant PID2020-115734RB-C21 (ADA/SSL-ADA subproject) funded by program MCIN/AEI/10.13039/501100011033 of Ministerio de Ciencia e Innovación.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Detailed results
Tables 7 and 8 show the IoU of each class as obtained by RADA for {GTA5, Synscapes, Urbansyn} \(\rightarrow\) Cityscapes and Mapillary Vistas, corresponding to experiments in Table 4. The first table is for DeeplabV3+ and the second is for SegformerB5. Analogously, Tables 9 and 10 contain the detailed IoUs for UDA + ADA corresponding to experiments in Table 5.
Appendix B Analysis of variance
The results in Tables 4 and 5 correspond to a single experiment. A natural question is how these results vary in several runs of the same experiment. In order to assess it, we have selected the cases of 1% budget, source {GTA5, Synscapes, UrbanSyn} and each of these four combinations: the two targets Cityscapes and Mapillary Vistas, and the two models DeepLabV3+ and SegformerB5.
Table 11 shows the per-class and global mean and standard deviation of IoU for three runs of each of these four cases, for RADA and UDA + ADA. We can appreciate that the standard deviation is low, at most 0.22 and 0.39 for RADA and UDA + ADA, respectively.
Appendix C Comparison with semi-supervised methods
The results in Table 12 show a comparison with the actual state-of-the-art of semi-supervised methods on Cityscapes dataset. We observe how our RADA outperforms the other methods by a notable margin and less equivalent budget of labeled target data, 5% vs 6.25%. Furthermore, our UDA + ADA reached state-of-the-art, achieving a 79.0 mIoU score using a budget of only 1.7%.
Appendix D Clustering versus k-nearest neighbors for region label propagation
In Sect. 3.3 we have mentioned k-nearest neighbors as a possible choice for label propagation from regions annotated with groundtruth labels to other regions similar in their representation according to the learned model (i.e., mean feature vector). The problem is then that the higher k is, the more chances are that such pseudolabels are wrong. An incorrect pseudolabel for a region is one that does not coincide with the majority of the ground truth label of its pixels.
Here we quantify the accuracy of region labels and its effect on the mIoU as a function of k, and compare it with the adopted solution of label propagation by hierarchical clustering for the case of source {GTA5, Synscapes, UrbanSyn}, target Cityscapes, 1% budget and DeepLabV3+. Table 13 shows that our choice is better by at least 1% mIoU with respect to k-nn because of the lower accuracy of region labels of the latter.
Figure 11 compares inference with a model learned with label propagation with clustering and k-nearest neighbors. Overall, with clustering, the regions are better delineated, and there are not so many errors.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Serrat, J., Gómez, J.L. & López, A.M. Closing the gap in domain adaptation for semantic segmentation: a time-aware method. Machine Vision and Applications 36, 13 (2025). https://doi.org/10.1007/s00138-024-01626-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-024-01626-z