1. Introduction
Fisheye cameras, characterized by their expansive field of view, provide a broad hemispherical view [
1,
2]. Their unique optical properties provide significant advantages, including capturing wide areas within a single frame and reducing the number of cameras required for panoramic views. Owing to these features, they have been extensively adopted across various fields. From autonomous vehicles [
3] that rely on comprehensive environmental perception, to simultaneous localization and mapping (SLAM) [
4] systems targeting efficient mapping, to virtual reality (VR) [
5] applications seeking immersive experiences, the applications of fisheye cameras are diverse and wide-ranging.
With the progression of deep neural networks and the expansion of datasets, performance in object detection tasks has significantly improved [
6]. However, the inherent characteristics of fisheye lenses induce significant radial distortions into these datasets [
7]. The shape of objects becomes notably distorted, and as one moves toward the edge, i.e., away from the lens’s center, this distortion becomes even more exaggerated. As a result, the same object can have varying appearances and shapes depending on its position.
To address these distortion issues, numerous research initiatives have been undertaken. Some studies proposed methods that rectify fisheye distortion and then applied conventional object detection methodologies [
3,
8]. These approaches, however, require knowledge of the fisheye camera’s intrinsic parameters and lead to a loss of edge information from the raw image during the undistortion process. Another method designed convolution techniques based on the spherical shape of the image [
9]. This method heavily relies on position-based convolution kernels and fails to fully harness the boundary continuity advantage inherent to fisheye images while focusing on omnidirectional images. Furthermore, certain studies focused on directly implementing object detection techniques for fisheye datasets [
10,
11]. Due to the distortion in fisheye datasets, the shape of an object changes based on its position within the image. This phenomenon can be observed in
Figure 1, where it is evident that as objects move toward the edges of the image, their shapes appear increasingly distorted, exhibiting alterations in orientation and size. Consequently, a substantial volume of fisheye image data is required to effectively train models to detect objects under variations across different positions in the fisheye images.
It has been recognized that the rectangular bounding boxes typically used in object detection are not optimal for use in fisheye images. Conventional bounding boxes can only provide an approximate location of an object and do not contain information about its shape. In order to address the distortion issue in fisheye images, recent studies have focused on modeling object representations using 24 points [
10,
12,
13]. This method achieves clear boundaries and minimizes environmental noise, making it robust against distortion. These advantages grant the technique the ability to be distortion-invariant, which is particularly suitable for fisheye objects.
The recently prominent pretraining–fine–tuning paradigm [
14,
15,
16,
17] is widely utilized as a training method, in which large-scale models are pretrained on extensive datasets and then fine-tuned for adaptation to various downstream tasks [
18,
19]. By providing task-specific information to the pretrained model through parameter-efficient tuning methods [
20,
21,
22,
23,
24], promising results have been demonstrated across a multitude of downstream tasks. Among these parameter-efficient tuning methods, visual prompting (VP) [
24] operates by learning perturbations in the shape of padding from a pixel-level perspective and then applying them to the input image. Inspired by this paradigm, we used pretrained models with a large number of standard datasets, fine-tuned with the fisheye dataset, and generated and applied VPs.
To effectively conduct object detection in fisheye images, we propose ‘VP-aided fine-tuning’, as illustrated in
Figure 2. The two key elements of this approach are as follows: (1) employing the pretraining–fine-tuning paradigm effectively through VP, and (2) regressing the object to 24 points. With VP, we bridge the domain gap between the standard dataset used to train the pretrained model and the downstream task dataset of fisheye images, enabling efficient adaptation. By regressing the object to 24 points, we achieve robustness against image distortion, clearly delineate the object boundaries, and reduce environmental noise.
The main contributions of this paper are as follows:
We propose a new approach called ‘VP-aided fine-tuning’, which applies VP to various fine-tuning strategies, thereby enhancing the performance of 24-point object detection.
We attempt to enhance performance by reducing the distance between the pretraining dataset and the fisheye dataset through learnable VP.
The experimental results demonstrated that our proposed approach with VP enhanced the performance across traditional and recent tuning techniques. Furthermore, it was shown that the performance was enhanced when using convolution-based models and transformer-based models as the backbone.
3. Methods
3.1. Dataset Generation
Representing objects in fisheye images, which are characterized by radial distortion, with traditional rectangular bounding boxes is not ideal. Therefore, we opted for a more generic shape representation of objects to diminish the background’s influence, choosing to depict them as polygons. This approach aligns with the method introduced by Rashed et al. [
10], which employs a 24-point annotation system. We applied this technique to represent objects as polygons in the Woodscape fisheye dataset.
First, we identified the centroid of the object utilizing the provided rectangular bounding box annotations. We then transformed the instance segmentation annotations into object masks and stored them accordingly. Given that the number of annotations and classes surpassed that of the bounding box annotations, we aligned them based on the bounding box criteria. The objects were categorized into five distinct classes: pedestrians, vehicles, bicycles, traffic lights, and traffic signs. Next, as shown in
Figure 3, we emitted 24 rays at 15-degree intervals from the center point and annotated the intersections with the mask’s boundary as the 24 points. These intersections were marked, forming a 24-point annotation for each object. Finally, the results labeled with 24 points are shown in
Figure 4. Furthermore, annotating objects with 24 points offers several advantages over segmentation. First, it has a lower computational cost, since it does not require storing a class label for every pixel, which is advantageous for real-time processing. Moreover, it generally delivers better generalization performance. Segmentation operates at the pixel level, leading to underperformance when object boundaries are ambiguous or when objects overlap.
However, representing objects in polygonal form makes the training process more challenging than the conventional bounding box approach. The increase in the number regression items slows down the convergence speed and compromises accuracy. Additionally, the computation of intersection over union (IOU) becomes more intricate and challenging. Xu et al. [
12] considered irregular bounding boxes and employed an overarching framework using the anchor-free YOLOX [
35]. They proposed a loss function based on the concentric circle distribution, termed GIOU. This loss function had the advantage of eliminating the need for direct IOU computation for irregular bounding boxes. The loss function is defined as
where
and
represent the predicted and ground truth concentric circles, respectively. The
is defined as
where
and
mean the area of the predicted circle and label circle, respectively.
is the maximum circumcircle area of the predicted circle and label circle. The circular IOU (CIOU) is computed as
In Equation (
3), SIOU is calculated as
where
and
are given by
Here, is the Euclidean distance between the centers of the two circles. represents the radius of the larger circle, while denotes the radius of the smaller circle. SIOU represents the area of the intersecting part between the two circles.
3.2. Overview of Visual Prompting
Inspired by the success of the prompt paradigm in the field of NLP, Bahng et al. [
24] introduced the concept of VP for pretrained vision and vision-language models for the first time. VP aims to learn a task-specific visual prompt, denoted as
and parameterized by
, which is a set of pixel-level modifications directly added to the input image. This method is designed to assist in the adaptation of the pretrained model to downstream tasks, while the model parameters remain fixed. For a given input image
x and its corresponding label
y, the learning objective of VP is defined as follows:
Here, and represent the parameters of the pretrained model and the visual prompt, respectively. During the training phase, the likelihood of the correct label y is maximized. In the inference phase, the trained prompt is added to test images, which are then fed through the pretrained model for classification or detection. This addition of the prompt does not require substantial computational resources, making VP a resource-efficient alternative to full model fine-tuning.
3.3. VP-Aided Fine-Tuning
Our research, as illustrated in
Figure 5, employed the pretraining-fine-tuning paradigm for fisheye object detection and explored methodologies to enhance fine-tuning performance using VP. The pretrained model was trained on a standard dataset and then fine-tuned for the downstream task on a fisheye dataset. By adding VPs to the input image, we provided downstream task-specific information to the model, making the adaptation more efficient. The visual prompt is designed to match the resolution of the original fisheye image, ensuring seamless integration when it is added to the image. The prompt, consisting of learned padding values, is applied directly at the pixel level to the fisheye image. During the training process, padding is learned and strategically placed only along the edges of the image. This placement is vital, because it helps to correct the typical distortions found in fisheye images, while leaving the central part of the image unchanged, preserving essential visual information. We describe the method of incorporating and continuously updating the visual prompt with the original fisheye image as follows:
Here,
represents the visual prompt,
the learning rate,
the gradient of loss
L as defined in Equation (
1),
the model parameters,
x the input, and
y the true label. After updating the prompt, the resultant image
is clipped to the valid pixel range [0, 255], and this is represented by the following equation:
Consequently, both the prompt and the model are updated together through the loss function. In fine-tuning tasks like ours, where the distribution of pretrained data and input images is different, the deployment of prompts can alleviate this difference and enable effective learning. We have discovered that, even with a reduced number of learning parameters, the application of VP to fine-tuning methods can achieve enhanced detection performance. Moreover, VP can be easily applied to any fine-tuning method, since it simply involves adding to the input image.
We applied VP to standard fine-tuning methods including partial fine-tuning (PT), FT, and bias tuning [
31]. PT, inspired by the VPT’s head-tuning setup, involves freezing the backbone and tuning the neck and head. Additionally, we adopted recently proposed fine-tuning enhancement techniques like PT-FT and L2-SP [
32]. These fine-tuning methods are illustrated in
Figure 6. PT-FT was inspired by the LP-FT [
33] setup, conducting FT after PT instead of LP. The L2-SP approach regularizes fine-tuned weights towards pretrained weights, employing a regularization term, as follows:
Here, denotes the weights pretrained on source data (standard images), acting as the starting point. represents the weights of the source network, that is, the weights of frozen parameters, and pertains to the weights of tunable parameters. and are regularization parameters that determine the strength of the penalty.
3.4. Theoretical Analysis of Visual Prompting
To elaborate on why the visual prompting method enhances performance in fisheye object detection, we draw a parallel to adversarial reprogramming [
36], a concept traditionally associated with adversarial attacks. Adversarial attacks, such as the fast gradient sign method (FGSM) [
37] and projected gradient descent (PGD) [
38], utilize input perturbations to deceive a model. These attacks are characterized by updating perturbations using a method contrary to gradient descent, typically termed as gradient ascent on the loss landscape, to degrade model performance in its task:
Here,
represents the perturbation,
the learning rate,
the gradient of loss
L with respect to the perturbation,
the model parameters,
x the input, and
y the true label. In contrast, VP employs gradient descent to iteratively refine these perturbations, referred to as ’prompts’, which are aimed at enhancing rather than degrading model performance.
Here, is the learning rate used for gradient descent, promoting an optimization that enhances model performance. Both adversarial reprogramming and visual prompting effectively modify the input to adapt the model for a new or refined task. However, while adversarial reprogramming seeks to subvert or repurpose the model, visual prompting through methods such as ours aims to harness this adaptability to improve accuracy and robustness in fisheye object detection. By modifying the input slightly with learned perturbations (prompts), our method consistently outperformed standard training approaches without prompts, confirming the practical utility and theoretical soundness of our approach.
5. Conclusions
This paper addressed the unique challenges posed by fisheye images, particularly the severe radial distortion affecting object detection tasks. We proposed a ‘VP-aided fine-tuning’ approach, leveraging the strengths of the pretraining–fine-tuning paradigm, augmented with VP and a 24-point object regression method to enhance object detection in fisheye images. Our method effectively bridges the domain gap between standard and fisheye datasets, enabling the pretrained model to adapt more efficiently to the distortion-specific characteristics of fisheye images. The employment of a 24-point representation for object delineation proved to be significantly advantageous in terms of robustness against image distortion, precision in boundary delineation, and the reduction of environmental noise. Recent research has actively explored regression methods for efficiently performing object detection in fisheye images, including approaches that use 24-point regression. By applying our method within the pretraining–fine-tuning paradigm to these studies, we could further enhance performance. Our approach is to integrate a VP consisting of few parameters into the input image. This method can be easily applied across various fine-tuning methods and diverse backbone models. In addition to being simple to implement, this approach enhances the efficacy of object detection models. This adaptability and improved performance underscore its potential as a valuable tool for advancing research in object detection.