1 Introduction

Automated medical image diagnosis and analysis have attracted attention in medical imaging research because manual labeling is tedious and prone to error [1]. Accurate and reliable diagnosis and analysis are necessary for effective decision-making, so information from examinations and measurements is extracted automatically. The challenging problem in polyp detection is that polyps have many variations in visual observation and visual characteristics, such as variations in color and appearance for the same polyp due to varying lighting conditions, texture, shape, etc. Therefore, no single feature can mainly take charge of and perform the best for detecting all the polyps. Most recognition methods apply conventional machine-learning techniques, generally starting with an extraction step for handcrafted features, such as color, texture, and shape, followed by a separate training process for detection or segmentation [2]. These designated features encode part of the polyps' information only and easily neglect the intrinsic information of colonic images. Because of such cases, these methods are not sensitive enough to differentiate polyps from normal (non-polyps) images.

Deep learning is a data-driven approach that has significantly improved the state-of-the-art in computer vision. Deep learning can be sensitive to minute details by multiple sequential nonlinear transformations but insensitive to large irrelevant variations. Therefore it is distinguishable between hyperplastic polyps and adenomas in posterior polyps.

With the advance in deep learning, physician-level performance can be achieved in automated medical image analysis tasks. High representation power, fast inference, and weight sharing have made CNNs the standard for image classification and segmentation [3]. The lack of sufficient labeled medical samples or images is a challenge for medical image applications that involve deep learning [4]. Existing applications heavily rely on cascaded CNNs if targets to be detected are subject to large inter-patient variations in terms of shape and size. CNN-based frameworks extract a region of interest (RoI) and make dense predictions for that RoI. However, this approach leads to excessive and redundant computational operations and model parameters [5].

This study proposes a simple and effective solution, named the deduce-induce network (DIN), to address this general problem. The proposed model uses a dual convolutional encoder–decoder (codecs) architecture [6][7] and incorporates an attention model [8] and a region proposal network [9] for the detection (localization and classification) of intestinal polyps. Model parameters and intermediate feature maps are used more efficiently, and computation is minimized.

In a convolutional encoder–decoder structure, the encoder learns the convolutional layers’ input features and then compresses those features into a code vector. This vector has a lower dimension than the input dimension. The decoder learns to reconstruct the data from the reduced code to a representation close to the original input. In addition to being realized as an auto-encoder, the structure of this encoder–decoder is adapted for hetero-association tasks using supervised learning, whereby the encoder encodes pixel intensities as a low dimensional discriminative feature vector when the decoder acts as a hetero-encoder to reconstruct the pixel intensities for the desired output image in the other domains. The intrinsic function treats each image individually and ignores multiple and diverse information about the image. The learned image features supported on the ambiances should still be akin to the training ones [10]. Studies show that the greater the number of similar features, the greater the probability they appear in similar images. Therefore, local image information is essential for analyzing discriminative features, especially for the colonoscopic images, where the difference in intensities is small, and the color between target and background is always vague.

Recently, region-based convolutional neural networks (R-CNN) [11] and fast R-CNN [12] have increased the performance of general object detection by combining selective search and deep learning. Region proposals in these novel detectors replace the sliding window approach. The region proposal network (RPN), the backbone of a faster R-CNN [9], has a classifier and a regressor. It generates bounding box proposals for a region where the target object resides. A small network slides over a feature map by the last convolutional layer. Candidate object regions (RoIs) are selected aggressively using class-agnostic non-maximal suppression (NMS) by removing redundant bounding boxes that cover the object to be detected. The RoI layer processes candidate regions' features on the entire image, so features are extracted from each candidate region if the training speed and test speed are sufficiently fast. The region proposal determines the possible locations of targets in the image to ensure that a higher recall rate is maintained when fewer windows are selected.

Generating region proposals are integrated into the networks, where the candidate window is acquired a higher quality than a typical sliding window algorithm. In contrast to conventional RPNs using a greedy process for the entire image to select an RoI, a sequential deep reinforcement learning model increases computational efficiency by producing search trajectories to generate a region proposal for a partial image and automatically determines when to terminate the search process.

The proposed deduce-induce network (DIN) is equipped with a modified RPN that focuses only on the local region of segmented targets predicted by the dual CNNs. In addition to the dual codecs, the modified RPN is applied to localization with prediction accuracy and computational efficiency because fewer model parameters are used than RPN-based algorithms. These RPN-based algorithms require pre-training, but dual-codec models have accessorial modules that can be trained from scratch in a standard way.

Attention models are inspired by how humans pay visual attention to a different portion of a scene or an image [6]. Visual attention focuses on a specific high-resolution area and perceives its surroundings in low resolution. Soft and hard attention are two of the major attention models. The distribution of soft attention is calculated using sequential scans over an entire image. The probability (attention score) reflects the importance of a representative feature vector and produces a weighted encoding feature vector. The attention mechanism is used in conjunction with a cascaded CNN so the mechanism can be learned along with backbone model training. A soft attention model is fully differentiable and easily trained using the end-to-end error backpropagation method.

The softmax function is used for normalization. It assigns small but non-zero probabilities to trivial features, so attention is assigned to an image's more significant elements. Hard attention selects a limited subset of features from a sequence of limited-sight glimpses, so attention is concentrated on more important subsets and diverted from less significant ones. Hard attention is suited to these tasks that involve sparse worthy-to-be subsets over an ample targeting space. The attention model automatically learns to focus on the target structure and highlight salient features useful for a specific task when there is sufficient contrast between a target and the background. Therefore, the proposed model is more sensitive and accurate because it uses local but dense inspections and ignores irrelevant regions.

This work aims to design a computer-aided detection system (CADs) for intestinal polyp detection in colonoscopy images. This proposed model applying a relatively shallow structure with a limited number of labeled polyps masked in binary for training is expected to perform similar to the methods with much deeper polyps detection and classification structures. The contributions of this paper are:

  1. 1.

    Labeled training data for polyps has been masked in binary images, so the proposed detection method segments images of the RoI using encoder–decoders and a region proposal for localization. A hard attention model is then used for the fine-grained classification of cropped regional images.

  2. 2.

    The hetero-encoder composition gives good images, and the auto-encoder auto-associates the ground truth and its variants to account for the lack of labeled medical images.

  3. 3.

    The experiments demonstrate that hard attention for bounded images assigned using region proposals is more accurate for different imaging data sets and training sizes.

  4. 4.

    The proposed mechanism produces fine-scale attention maps that can be visualized with minimal computational overhead, so the predictions are more easily interpreted.

  5. 5.

    This work is one of the first studies to cooperate with a hard-attention interface with a local RPN in a feed-forward CNN-based model for a medical imaging task. It is end-to-end trainable and eliminates the need to process the entire map.

  6. 6.

    Using an attention model to classify a cropped image allows a dense sequential prediction by focusing on specific and fine-grained features in segmented images [13][14]. The model gives better performance than methods that use the entire image.

2 Deduce-induce network

The proposed model uses dual codecs as the backbone, a local RPN for accurate box-bounding, and an attention model to improve the classification network, as shown in Fig. 1a. The dual codecs are implemented in two sub-networks [15]. The front module, the deducer, is a hetero-encoder that produces intermediate images for the rear codec. The rear codec, inducer, is an auto-encoder that uses a self-coding architecture to de-noise intermediate images to account for the lack of sufficient labeled data [16]17. The output from the auto-encoder is a binary segmented image that contains the detected targets. Figure 1b shows a brief description of the data flow in the proposed architecture.

Fig. 1
figure 1

a The proposed architecture comprises three modules: a DIN (a.k.a. the generation and collaboration network, GCN) for feature extraction, a hard attention model for classification, and a local RPN for precise object bounding. b A brief description of the data flow of the proposed architecture

By mapping the local region from the feature map generated by the last convolutional layer of the front codec's encoder, the local RPN uses the local feature map to generate anchor points with various bounding boxes. The modified RPN generates potential regional proposals using the local feature map of the autoencoder's cropped images and shares the maps with the following object detection networks. The region proposal network increases detection accuracy.

The hard attention model is used for fine-grained classification using a sequential glimpse over the cropped image to detect false positive or different types of polyps: hyperplastic polys and adenoma [11].

2.1 Image segmentation and cropping

In the front of the dual structure, the hetero-encoder maps a colonoscopic image to a set of grey-level images that correspond to the labeled image to produce an intermediate training set. The module generates a set of training samples that are morphologically similar to the labeled images. This design does not require a large amount of training data for deep learning. Figure 2 shows the front codecs’ architecture and the hetero-encoder, composed of four convolutional layers with max-pooling operations and three fully connected layers in the middle. A leaky rectified-linear activation function is used in the convolutional layers. Batch normalization is used for full connection layers. The final convolutional layer outputs pass a sigmoid function to map the predicted probability to the pixels’ grey-level intensity in the segmented image. The dual modules use the same architecture, except for the input layer. This arrangement is different from that which uses a sole encoder–decoder to render the image representation descriptive.

Fig. 2
figure 2

Framework for the deducer network. Xa: a colonoscopic image that is input to the hetero-encoder; H(Xa): the output is a corrupted labeled image that is a variation of the ground truth image for the auto-encoder; Xd: the ground truth image; A(H(Xa)): predicting the corrupted label image using the auto-encoder; A(Xd): the associated image for the ground truth that is predicted by the auto-encoder

The front end has three channels for the colored colonoscopic images but only one for grey-level images in the rear. The encoder–decoder at the front end initially predicts the binary-labeled polyp images for the corresponding colonoscopy images and produces variations during training to support the successive autoencoder denoising learning. Therefore, in the training stage, a set of grey-level images that are supposedly similar to the ground truth data are generated, and these are used as corrupted training samples to train the autoencoder to increase its plasticity and generalization. The downstream encoder–decoder module is trained as an auto-associator to predict the labeled image by presenting an approximated one that its predecessor generates. The first module generates a set of reliable candidates with a certain degree of noise morphologically similar to the labeled image during the training process. The advantages of this arrangement are twofold: the auto-encoders give a more plastic classification, so there is no requirement for a huge training data set, and the auto-encoder has increased discriminative power so that predictions can have high accuracy. It can update the upstream module's effective weights by transferring its own experiences while learning the association between corrupted and exact labeled data sets. The loss function for the front codec depicted in Fig. 3 is defined as:

Fig. 3
figure 3

The flowchart for the deducer and inducer networks training

$$||{\varvec{H}}({\varvec{X}}{\varvec{a}})-{\varvec{X}}{\varvec{d}}||+||{\varvec{A}}({\varvec{H}}({\varvec{X}}{\varvec{a}}))-{\varvec{A}}({\varvec{X}}{\varvec{d}})||$$
(1)

The loss function's first term reduces the residuals between the predicted image H(Xa) and the ground truth Xd. This term also allows the hetero-encoder to learn how to generate candidate images that are as similar as possible to the target image.\(\mathbf{A}\left(\mathbf{H}\left({\mathbf{X}}_{\mathbf{a}}\right)\right)\) A(H(Xa)) and \(\mathbf{A}\left({\mathbf{X}}_{\mathbf{d}}\right)\) A(Xd) in the second term are the predicted outputs from the auto-encoder that correspond to the inputs for \(\mathbf{H}\left({\mathbf{X}}_{\mathbf{a}}\right)\mathbf{a}\mathbf{n}\mathbf{d}{\mathbf{X}}_{\mathbf{a}}\) H(Xa) and Xa. This term is an actual regulation term used to slow down the convergence rate for the supervised learning conducted by the first term and maintain the successive term's learning pace so that the similarity in the semi-produced samples is not solely due to the labeled data. In other words, this term synchronizes two adjacent modules' learning speeds to ensure training samples generated by the front are learned to allow effective discrimination by the rear.

The auto-encoder learns the auto-association with the labeled image and its variants, which are generated by minimizing the reconstruction error using a loss function between the ground truth images and the corresponding intermediate images. The auto-encoder determines whether each element of data that it reviews belongs to the actual training data set. The mean squared loss function for the auto-encoder is defined as:

$$\alpha ||Xd - A(Xd)||+(1-\alpha )||Xd-A(H(Xa))||$$
(2)

where α is the parameter that regulates the degree of tolerance for variation within the distribution for a labeled image class.

The hetero-encoder’s predicted (deducing) image is further processed for image cropping by a sequence of operations in OpenCV [18] to determine the cropped images’ frames that contain predicted targets. The resultant images from the autoencoder’s output layer are produced by selecting the pixels with intensity in [50, 150]. A minimal rectangular frame encloses the pixels.

During initialization, the network clusters all of the original images and training standards into batch data sets with a similar number of data points. The training parameters are entered in sequence during the training phase. The detailed process for dual codecs is shown in Table 1.

Table 1 The process for the dual codecs, duducer and inducer

3 Proposed region proposal network

The local RPN input is the feature map from the last convolutional layer of the front codec's encoder. Forward propagation generates a feature map with higher dimensions. The feature map uses the suggested region and the regional score from the RPN, uses the NMS with a threshold value of 0.7 for the regional score, and outputs the Top-100 scores to the RoI pooling layer. The features recommended in the corresponding region are extracted directly from the input feature map and pass through a fully connected layer. The RPN then outputs bounding boxes after regression. Figure 4 shows the functional diagram for the improved RPN.

Fig. 4
figure 4

The diagram of the ordinary region proposal network

Using the cropped feature map, the local RPN performs a convolutional operation with a kernel of 3 × 3 and then splits into two branches to sort for a position regression. The first half of the operation is a foreground/background calculation classification using a 1-by-1 kernel and normalized by a softmax. An anchor frame’s positional accuracy initially generated using the center of the feature point is low, so an initial positional regression adjustment is made in the RPM. The operation at the bottom of Fig. 4 locates the regression route because the offset from the center point (x, y) and the scaling ratio of (h, w) are calculated after 1 × 1 filtering. The outputs are combined with classification (foreground and background), positioning regression, and original map information. The candidate regions are proposed, and anchor boxes are positioned for the first time in the local RPN.

Two loss functions train the local RPN: the classification loss function and the positioning regression loss function. The operations of the local region proposal network and the detection network are different for the training and recall stages. The foreground predicted in anchor frames by the local region proposal network is unreliable if the network has not learned how to classify and locate the regression at the initial stage. Therefore, the network uses the anchor frame and the training metric, IoU, as a training sample during training. It identifies the anchor box in the top 2000 foreground predictions. It then classifies anchor boxes with an IoU > 0.7 as the positive sample in sequence and those with an IoU < 0.3 as the negative sample. The remainder is omitted. 128 boxes individually from positive and negative samples participate in training, and the positive boxes are used for positioning regression. (3) and (4) define the loss function for classification using the cross-entropy function. \({{\varvec{p}}}_{{\varvec{i}}}\) is the probability that the training data belongs to one of the classes in the foreground, and \({{\varvec{p}}}_{{\varvec{i}}}^{\mathbf{*}}\) is the label. If the presence is positive samples, \({{\varvec{p}}}_{{\varvec{i}}}^{\mathbf{*}}\) is 1, and 0 otherwise. \({{\varvec{N}}}_{{\varvec{c}}{\varvec{l}}{\varvec{s}}}\) trains the number of training samples.

$$classification\, loss\, of\, RPN= \frac{1}{{N}_{cls}}\sum_{i}{L}_{cls}({p}_{i},{p}_{i}^{*})$$
(3)
$${L}_{cls}({p}_{i},{p}_{i}^{*})= -({p}_{i}^{*}\mathit{log}{p}_{i}+(1-{p}_{i}^{*})log(1-{p}_{i}))$$
(4)

The positioning regression loss function is defined in (5) (6), where \({{\varvec{t}}}_{{\varvec{i}}}^{\mathbf{*}}\) is a location vector that records the displacement of the coordinates of the center point of the anchor box that is predicted by the network and the scale of the length and width of the frame, \({{\varvec{t}}}_{{\varvec{i}}}\) is the labeled location of an anchor box to be shifted and scaled. \({{\varvec{N}}}_{{\varvec{r}}{\varvec{e}}{\varvec{g}}}\) is the number of training samples. A smoothing function [9] is used for the loss function.

$$regression\, loss\, of\, RPN:\frac{1}{{N}_{reg}}\sum_{i}{p}_{i}^{*}{L}_{reg}({t}_{i},{t}_{i}^{*})$$
(5)
$${L}_{reg}\left({t}_{i},{t}_{i}^{*}\right)=R({t}_{i}-{t}_{i}^{*})$$
(6)
$$R\left(x\right)=SmoothL1\left(x\right)=\left\{\begin{array}{c}0.5{x}^{2} if|x|\le 1\\ \left|x\right|-0.5 otherwise\end{array}\right.$$
(7)

The loss function is defined in classification and regression in the detection network. An IoU \(\ge\) 0.5 is a positive sample, and 0.5 \(>\) IoU \(\ge\) 0.1 is a negative sample. At this stage, the polyps are classified as either polys or false positive (intestinal wall), so T, the number of classes, is 2. The classification loss function for the detection network is the same as that for the local region proposal network. The positioning regression function also uses the Smooth function. Table 2 shows the pseudo-code for the position regressing process for the modified RPN

Table 2 The proposed local RPN algorithm
$$classification\, loss\, of\, Detector:\frac{1}{{N}_{cls}}\sum_{i}{L}_{cls}\left({S}_{i},{y}_{i}\right)$$
(8)
$${L}_{cls}= -\sum_{j=1}^{T}{y}_{j}log{S}_{j}$$
(9)

4 Hard attention model for instance segmentation

In the second stage of the proposed system, a hard attention model is used to classify the cropped polyp image classified by the dual codecs as an adenoma, a hyperplastic polyp, or an intestinal wall (non-polyp). The hard attention model comprises four components: the glimpse network, the core network, the action network, and the location network. A few layers of neural networks are used to construct these networks.

In the glimpse network, an image capturing window is placed over the image using the suggestions from the location network trained using RL The patched local image in the window is sent to the core network, a three-layer recurrent neural network for fine-grained feature extraction. The core network output contains previous experiences recorded in the hidden state and is used as the state and feature vector for the location and classification networks. The classification is performed using a simple two-layer network, which recognizes the cropped image’s polyps. The classification accuracy is a reward for the RL module. The location networks use RL with an actor-critic structure [19]. The network is a simple neural network. Training classification also uses cross-entropy functions, as shown in (10), where \({{\varvec{P}}{\varvec{r}}{\varvec{e}}{\varvec{d}}}_{{\varvec{b}}}\) is the probability of the \({{\varvec{b}}}^{{\varvec{t}}{\varvec{h}}}\) sample in the batch, \({{\varvec{T}}}_{{\varvec{n}}},\) as calculated using a softmax function.

$$classification\, loss=-\sum_{b=1}^{B}\mathit{log}{Pred}_{b}$$
(10)

RL in the localization network calculates rewards using classification outcomes to calculate policy gradient [20], as shown in (12) The critic network learning calculates the baseline BL by the error backpropagation, for which the loss function is defined in (14). The algorithm code for the classification network is listed in Table 3.

Table 3 The algorithm for the hard attention model for classification and localization
$$Policy gradient={\nabla \overline{R} }_{\theta }\approx \frac{1}{N\times {T}_{n}}\sum_{n=1}^{N}\sum_{t=1}^{{T}_{n}}\left(R\left({\tau }^{n}\right)-{\varvec{B}}{\varvec{L}}\right)\nabla \mathrm{log}{p}_{\theta }\left({a}_{t}^{n}|{s}_{t}^{n}\right)$$
(11)
$$Location loss={\overline{R} }_{\theta }=\frac{1}{N\times {T}_{n}}\sum_{n=1}^{N}\sum_{t=1}^{\mathrm{Tn}}R\left({\tau }^{n}\right)\nabla \mathrm{log}{p}_{\theta }\left({a}_{t}^{n}|{s}_{t}^{n}\right)$$
(12)
$$Baseline loss={\varvec{B}}{\varvec{L}}=\frac{1}{N\times {T}_{n}}\sum_{n=1}^{N}\sum_{t=1}^{\mathrm{Tn}}{{({\varvec{R}}}_{n,t}-{\varvec{B}}{{\varvec{L}}}_{n,t})}^{2}$$
(13)

5 Experiments

The proposed DIN uses features that are learned from open medical data sets: CVC [21], 22, Geana [23], and the data set provided by the Zhejiang University (ZJU). Images in the former two contain polyps for various lighting conditions, zooming, and optical magnification. Images in these databases are labeled in binary. The data for the database is augmented by applying rotation and reflection operations to the original images. All images are resized to a resolution of 128 × 128. Images that contain no polyps are randomly selected from the data set of the ZJU. Images that do not contain polyps are randomly sampled from the data set collected at the hospitals of Zhejiang University (ZJU). The experiments verify the performance of the DIN in terms of positioning and classification. The positioning and classification experiments used five-fold cross-validation tests [24]. When all data was scaled uniformly to 256 × 256 before training, the sample was randomly divided into five parts: used for training and the remaining one as the test data to verify the effectiveness with which the model is trained.

The parameter α in the DIN has a value of 0.5, and the batch size is 32. The learning rates for the codecs are 0.0001. For the local region proposal network, in response to the different lesions in the colonoscopic image, the size of the anchor frame is 1 × 1, 2 × 2, 4 × 4, 8 × 8, and 16 × 16. The aspect aberration is 0.5, 1, and 2, so 15 anchor frames are generated for each unit feature on the feature map. The network only uses 128 positive and negative samples per iteration for training, so each round of data training uses 256 samples. The learning rate for the first 80,000 times is 0.001 and for the next 30,000 times is 0.0001.

The results are an average of five-fold validations. Table 4 compares IoU values for the results produced directly from the dual codec and those after adjustment using the local RPN. As shown in Table 4, the IoU values for cropped frames that the RPN produces are greater than 0.5 and are more precise than those in a 4–6% gap without the adjustment. Figure 5 shows the images for the experiments. The first column shows the original images, and the second one shows the corresponding labeled images. The middle column shows the result from the rear codec in the gray level, and the next shows the bounding box that OpenCV draws. The result of using the RPN is shown in the last column. Figure 6 shows noisy frames appearing in the results directly taken from the codec without RPN adjustments. The improvements when the RPN is used are seen in the image in the fifth column.

Table 4 IoU value with/without local RPN adjustment
Fig. 5
figure 5

Visual results before/after the local RPN is used in the experiments

Fig. 6
figure 6

Noisy frames appear in the results directly taken from the codec without local RPN adjustments

5.1 Comparison of the classification of cropped images with IoU \(\ge\) 0.5

The comparisons with two renowned models, the U-net [25] and LeNet [26], are made in the five-fold validation test. Frames with an IoU \(\ge\) 0.5 are fed to the hard attention models, LenNet-5 U-net. The U-Net architecture uses a fully convolutional network with asymmetrical architecture. Still, it is modified to allow better segmentation for medical imaging by skipping connections between the sampling paths using a concatenation operator up-sampling path. The skip connections add local information to global information during upsampling, so location information from the sub-sampling path is combined with contextual information from up-sampling from general information. LeNet-5 comprises two convolutional and average pooling layers, a flattening convolutional layer, two fully-connection layers, and a softmax function. Table 5 shows the accuracy with which adenoma, hyperplastic polyp, and intestinal wall are classified using the attention model with a sparse reward and a dense reward for LeNet-5. The attention model with a sparse or dense reward function performs significantly better than LeNet-5.

Table 5 Comparison with the state-of-the-art methods for classification in cropped images with IoU >  = 0.5

This experiment verifies the hard attention model's classification ability compared with the states of art methods. Since all cropped images have sufficiently high IoU values, the prediction falls into two categories: hyperplastic polyp and adenoma. The ratio of the training samples for hyperplastic polyp and adenoma is 3:1. Table 5 shows the experimental results. This experiment demonstrates that the DIN better recognizes targets in surroundings with low contrast intensity (95.78%, 0.942.8%, 94.51%).

5.2 Comparison of the classification of cropped images with a probability of polyp \(\ge\) 0.5

The experiment determines the effect of precise cropping on the accuracy of the classification by DIN Table 6 shows the classification of the DIN for two different sources of the cropped image. The column, w/o local RPN, shows the classification accuracy for images extracted from the bounding boxes calculated using OpenCV for gray images produced by the dual codec. The last column, w. local RPN, shows the results for images that the RPN regards as foreground.

Table 6 Classification for the DIN w. and w/o local RPN for the cropped image for which the probability of polyp ≥ 0.5

Figures 7 and 8 show the trace of the glimpse for the DIN. The examples in Fig. 7 are accurately classified. Suppose the initial location of the glimpse is in a low contrast region. In that case, the DIN ensures that the next glimpse is in a more distinguishable region, such as a light-reflecting spot or the margin of a lesion. The failure cases in Fig. 8 indicate some improvements still need to be made in the future. Some failures are caused by incomplete cropping of polyp images, but most cases are due to low contrast, which confuses the DIN.

Fig. 7
figure 7

the trace of the glimpse for accurate classification using the DIN

Fig. 8
figure 8

the trace of the glimpse for inaccurate classification by the DIN

6 Conclusions

The models for biomedical image segmentation are often trained with heavy supervision, relying on pairs of images and corresponding labels. Nevertheless, acquiring segmentation of anatomical regions in many medical cases can be tremendously expensive. In this paper, we present a hetero-encoder segmentation strategy that is flexible and adaptive such that the training of the model does not require heavy supervision and can continuously. This study proposes a compositional model that uses dual encoder–decoders for image segmentation, an attention model for these dual encoders, and a simplified RPN to bind the polys (object of the interest) more accurately. The proposed model leverages an available autoencoder providing a set of segmentation priors from a paired segmentation image. Besides, the model uses a dual network to map a colonoscopic image to an aggregated set of exemplified images as data argumentation. It focuses only on the region of segmented targets predicted by the dual CNN for region proposals.

The performance of the proposed model is demonstrated in real-time intestinal polyp detection. The task is challenging because it is difficult to interpret and distinguish tissue contrast in colonoscopic images. In computer-aided diagnosis, the precise localization of the object of interest is key to successfully classifying hyperplastic polys and adenoma with various shapes and sizes. The validation test results show that accurate localization increases the classification performance of the proposed system. The outcomes also demonstrate that the system is more sensitive to polyps and more accurate than state-of-the-art methods.

Furthermore, the proposed model consistently increases prediction accuracy for different data sets and sizes of training sets while achieving state-of-the-art performance without requiring deep CNN models. Although generic RPN-based algorithms always have a better performance, pre-training the feature map is an essential procedure and computation time consumption. Future work will study a fuzzy partition process to extract candidate RoIs based on different levels of discretized images to solve the problem instead of the cropped segmentation images.