Keywords

1 Introduction

Ultrasound is the most commonly used imaging modality in prenatal examinations and disease diagnosis due to several advantages such as low cost, non-invasion, real-time imaging and free of radiation. Recently, 3DUS has been used in many clinical applications, for instance, fetal intracranial volume segmentation for assessing brain development [1] and prostate segmentation in transrectal ultrasound (TRUS) for diagnosis of prostate cancer [2]. However, image segmentation for 3DUS is challenging due to acoustic shadow, speckle noise and low tissue contrast which may cause missing boundaries and structures [3], leading to inaccurate results.

With the development of 3D Convolutional Neural Networks (CNNs), some 3D models have been proposed to solve the problem of image segmentation in 3DUS [4, 5]. Although 3D CNNs can generate plausible results by combining contextual and spatial information, there are still some critical limitations in real clinical use for ultrasound. Firstly, lack of pre-trained 3D models makes training process require much more volume data which is usually difficult to acquire. Secondly, in order to reduce memory consumption, 3D CNNs often have to reduce batch size or divide volume into several cubes which may decrease the robustness of model. Thirdly, the deployment and application of 3D CNNs in many ultrasound imaging systems are unpractical due to limited computational units and memory (e.g., no GPU).

On the other hand, 2D network generally has less computational complexity and more options of pre-trained model. Moreover, sharing the same merits, 2.5D network is able to adopt more contextual and spatial information which could be ideal for volume segmentation. Wang et al. [6] segmented lung nodule (CT) based on voxel-by-voxel classification with three orthogonal image patches as input. Motazi et al. [7] applied 2D FCN to segment left atrium and proximal pulmonary (MRI) in cross sectional image along axial, sagittal and coronal direction respectively and further fused the 2D segmentation results into final 3D output. Both [6] and [7] require three individual CNNs to extract features or segment objects which may result in parameter redundancy and inefficiency of deployment.

In this paper, we propose an efficient and universal method for single organ segmentation in 3DUS. Our contributions are twofold. (i) A novel 2.5D volume segmentation framework is proposed which can achieve high accuracy but low complexity. We believe this is the first and successful attempt to employ 2.5D end-to-end segmentation network for this problem. (ii) To further improve the performance in low SNR regions, we incorporate a new mechanism of adaptively rectified supervision (ARS) at training stage. Specifically, a pixel-wise reweighted dice (PRD) loss is calculated to improve the sensitivity of segmentation; image-wise shape regularization (ISR) loss is calculated to provide domain knowledge of shape for more plausible results. The proposed method is extensively evaluated on two different tasks: fetal intracranial volume segmentation (132 volumes) and TRUS prostate volume segmentation (18 volumes).

2 Methods

Figure 1 illustrates the proposed 2.5D segmentation framework incorporated with ARS which employs pixel-wise reweighted dice loss and image-wise shape regularization loss at training stage to improve the performance of the segmentation network.

Fig. 1.
figure 1

The illustration of the proposed ARS-Net

2.1 Data Preprocessing

The task of 3D segmentation is converted to 2.5D by radially resampling an input volume into multiple planes (coronal) together with their orthogonal planes (sagittal) and 45° diagonal planes. It is worth noting that for 2.5D input, 45° diagonal plane is adopted rather than axial plane because of two reasons: (i) all the target planes share the same axial plane in radial sampling which cannot provide additional contextual and spatial information; (ii) Empirically, 45° diagonal plane in 3DUS is often less sensitive to acoustic shadow. Besides, all the 2D planes are sampled radially along with an automatically detected central axis by using STN [8] in order to reduce the variability of image positioning.

2.2 Network Architecture

We choose 2D FCN [9] as our basic segmentation network but any other 2D segmentation networks are also applicable, such as U-Net [13]. Three modifications of the basic segmentation network are needed. Firstly, a pre-trained Vgg-16 [10] is used as the backbone architecture. Secondly, PRD loss is calculated by adaptive weight map (AWM) to improve the performance in low SNR regions. AWM is generated by a specific attention mechanism based on: (i) ground truth segmentation, (ii) probability map of predicted segmentation from last epoch, and (iii) AWM from last epoch. Thirdly, ISR loss is adopted to avoid shape distortion by adding an auxiliary module of discriminator network (DN) which is similar to [11] but the backbone architecture is replaced by the same pre-trained Vgg-16 [10] as mentioned above. The proposed mechanism of ARS consists of PRD loss and ISR loss.

2.3 Adaptively Rectified Supervision

There are already many powerful CNN-based segmentation methods for natural images. However, segmentation for 3DUS is challenging due to ambiguous or missing boundaries (low SNR regions) which often lead to irregular and erroneous segmentation results. To overcome this issue, we introduce a new ARS loss function to replace the regular dice loss function of FCN. Specifically, PRD loss \( L_{prd} \) and ISR loss \( L_{isr} \) are combined to jointly supervise the generation of segmentation probability map. The ARS loss function is defined as:

$$ L_{ars} \left( {X,W,G,Y} \right) = w_{prd} L_{prd} \left( {X,W,G} \right) + w_{isr} L_{isr} \left( {X,G,Y} \right) $$
(1)

where \( X \) is an input image, \( G \) and \( Y \) denote the corresponding ground truth of pixel-wise label (segmentation) and image-wise shape authenticity (real or fake) respectively, \( W \) is the corresponding AWM, \( w_{prd} \) and \( w_{isr} \) are the weights of \( L_{prd} \) and \( L_{isr} \). The details of \( L_{prd} \) and \( L_{isr} \) will be explained in the following paragraphs.

Pixel-Wise Reweighted Dice Loss.

Since the ratio between low SNR regions and whole volume is usually low (<10%), most segmentation networks are not sensitive to those regions as they may suffer from overfitting problems. Inspired by focal loss [12], PRD loss is designed to solve the issue of imbalanced image distribution. Specifically, based on AWM (please refer to Sect. 2.4 for details), the dice loss function is recalculated as PRD loss which can be defined as:

$$ L_{prd} \left( {X,W,G} \right) = \frac{{2\mathop \sum \nolimits_{j} (W_{j,k} *G_{j} )*\left( {W_{j,k} *P_{j,k} } \right)}}{{\mathop \sum \nolimits_{j} (W_{j,k} *G_{j} ) + \mathop \sum \nolimits_{j} \left( {W_{j,k} *P_{j,k} } \right)}} $$
(2)

where \( G_{j} \in \left\{ {0,1} \right\} \) is the ground truth label at location \( j \). \( W_{j,k} \in \left( {0,1} \right) \) and \( P_{j,k} = \frac{1}{{1 + e^{{ - z_{j,k} }} }} \in \left( {0,1} \right) \) represent the value of AWM and the probability of predicted segmentation at location \( j \) in epoch \( k \), respectively. \( z \) denotes the output of the last convolutional layer in FCN.

Image-wise Shape Regularization Loss.

To further improve the specificity of segmentation, the discriminator network mentioned in Sect. 2.2 is applied to calculate ISR loss which can help generate robust results with plausible shapes. A training image and its output prediction from FCN are used as inputs for the discriminator network to identify real or fake shape based on binary cross entropy. The ISR loss function is defined as:

$$ L_{isr} \left( {X,G,Y} \right) = Y\log P + \left( {1 - Y} \right)\log \left( {1 - P} \right) $$
(3)

where \( {\text{Y}} \in \left\{ {0,1} \right\} \) and \( {\text{P}} \in \left( {0,1} \right) \) denote the ground truth label and the prediction of real or fake shape, respectively.

2.4 Adaptively Weight Map

AWM is generated based on a specific attention mechanism which adaptively decreases the weights in regions of high accuracy while maintains the weights in regions of low accuracy (low SNR regions). In detail, AWM \( W_{j,k} \) is iteratively updated by ground truth segmentation \( G_{j} \), probability map of predicted segmentation from last epoch \( P_{j,k - 1} \) and AWM from last epoch \( W_{j,k - 1} \). AWM is defined as:

$$ W_{j,k} = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if\; k = 1,} \hfill \\ {\left( {1 - E_{j,k - 1} } \right) + \alpha E_{j,k - 1} W_{j,k - 1} } \hfill & {otherwise.} \hfill \\ \end{array} } \right. $$
(4)
$$ E_{j,k - 1} = \left\{ {\begin{array}{*{20}l} {P_{j,k - 1} } \hfill & {if\;G_{j} = 1,} \hfill \\ {1 - P_{j,k - 1} } \hfill & {otherwise.} \hfill \\ \end{array} } \right. $$
(5)

The modulation factor \( \upalpha \) is set to 0.8 by empirical. When \( E_{j,k - 1} \to 1 \), prediction \( P_{j,k - 1} \) is accordant with ground truth \( G_{j} \) which means it is well segmented at the pixel location of \( j \). Similarly, when \( E_{j,k - 1} \to 0 \), it means it is erroneously segmented at the pixel location of \( j \). Furthermore, when \( E_{j,k - 1} \to 1 \), \( W_{j,k} \to \alpha E_{j,k - 1} W_{j,k - 1} < W_{j,k - 1} \) which implies the weight for PRD loss is lower in well segmented regions. On the other hand, when \( E_{j,k - 1} \to 0 \), \( W_{j,k} \to \left( {1 - E_{j,k - 1} } \right) \approx 1 \) which implies the weight for PRD loss is higher in poorly segmented regions. Based on the above adaptively reweighting mechanism, the regions of low accuracy contribute more in the calculation of the dice loss function and vice versa.

2.5 Postprocessing for Final Result

For the final output of volume segmentation, the results of 2D segmentation are further combined and reconstructed by cubic-spline interpolation. Additional 3D Gaussian filter can be also applied to smooth the volume segmentation which can reduce the discontinuity of multi-plane reconstruction.

3 Experiments

Materials.

Experiments were carried on two representative and challenging datasets of 3DUS using DC-8 and Resona 7 Ultrasound Imaging System (Mindray, Shenzhen, China). The first dataset consists of 132 volumes (94 for training, 38 for testing) of fetal brain with gestational age (GA) ranged from 20 to 32 weeks which were scanned by curved array volume probe. The second dataset is made up of 18 TRUS prostate volumes (10 for training, 8 for testing) which were scanned by endocavity volume probe. These two types of probe are most commonly used for volume analysis in 3DUS. For data preprocessing, all images were standardized by resampling in the same resolution of 0.5 × 0.5 × 0.5 mm and all the 2.5D images were resized to 224 × 224.

Implementation Details.

Our proposed network was implemented with the popular library Keras for Tensorflow and both training and testing were performed on a 16G NVidia V100 GPU. Each volume was radially resampled into 60 planes (each plane combined with its orthogonal and 45° views as 3-channel input for ARS-Net) which significantly increase the number of training samples, thus solving the overfitting problem due to limited volume data. We further adopted data augmentation (flipping, cropping, rotating, and translation) at training stage. An initial FCN was pre-trained with PRD loss while batch size = 32 and learning rate = le-3 (decreased iteratively by a factor of 0.95 for every epoch). After that, the pre-trained 2.5D FCN together with a DN were further trained alternately. In every epoch, DN (ISR loss only with lr = 1e-4) was trained with 3 batches followed by 2.5D FCN (PRD + ISR loss with lr = 1e-5) trained with 1 batch. The optimizer was Adam with momentum set as 0.9, furthermore \( L_{prd} \) and \( L_{isr} \) were equally weighted. The total training time was about 6–7 h.

Segmentation Performance.

We compared the proposed method with several advanced methods, including 3D FCN [4], 2D FCN [9], U-Net [13], DAF [2] and a two-stage framework of FCN + LSTM [14]. We also demonstrate the efficacy of the proposed method with different loss functions including PRD alone and PRD + ISR. The evaluation metrics included Dice Similarity Coefficient (DSC), Hausdorff distance (HD, in mm), Conformity Coefficient (CC) and Jaccard Index.

Figure 2 shows the results of fetal intracranial volume segmentation with different methods and loss functions. With PRD loss, the proposed ARS-Net is robust to blurry boundaries and low SNR regions while ISR loss is able to correct the results of irregular shapes. PRD loss can overall improve the sensitivity of segmentation while ISR loss is able to further gain the specificity.

Fig. 2.
figure 2

Comparison of segmentation results in two fetal brain volumes. Top row: target planes with manual labels (green), 2D FCN (blue), FCN + PRD (ours, yellow) and FCN + PRD + ISR (ours, ARS-Net, red). Bottom row: evaluation of ARS-Net based on Hausdorff distance [mm] (Color figure online).

Table 1 lists the quantitative comparison results of fetal intracranial volume segmentation. The proposed ARS-Net (PRD + ISR) shows DSC improvement of 3.01% and 9.37% comparing to 2D and 3D FCN respectively and achieves comparable accuracy with FCN + LSTM but 9 times faster as shown in Table 3. Also, ARS-Net reaches the lowest mean Hausdorff distance (1.31 mm) while bi-parietal diameter (BPD) of normal fetus (GA 20-32w) ranges from 46 to 80 mm which implies the accuracy of the proposed method is acceptable for clinical use. Table 2 lists the quantitative comparison results of TRUS prostate volume segmentation. The proposed method is slightly more accurate than FCN + LSTM [14] and DAF [2] but it is significantly faster and smaller as shown in Table 3. It is worth noting that the accuracy of 3D FCN is lower than 2D FCN mainly because of limited training samples, small batch size and lack of suitable pre-trained models. The proposed method shows advantages in accuracy, speed, model size and memory occupation and it can be an ideal solution for deployment in most ultrasound imaging systems.

Table 1. Quantitative comparison of fetal intracranial volume segmentation
Table 2. Quantitative comparison of TRUS prostate volume segmentation
Table 3. Runtifferent algorithms (volume size 224 × 224 × 224)

4 Conclusion

In this paper, we propose an efficient 2.5D framework that enables single organ segmentation in 3DUS image with high accuracy but low complexity. To the best of our knowledge, we are the first to use 2.5D end-to-end segmentation network for this problem. In the proposed ARS-Net, a novel attention mechanism is introduced to reweight the dice loss function at each pixel which can improve the sensitivity of segmentation in low SNR regions. Furthermore, a discriminator network is used to constrain results into plausible shapes which can gain the specificity of segmentation. Given the additional modules at training stage, the complexity of ARS-Net is as low as a regular 2D FCN at inference stage. Compared to 3D FCN, the performance of ARS-Net is more robust with small datasets. The validation on fetal brain (132 volumes) and TRUS prostate (18 volumes) shows that ARS-Net achieves DSC of 97.64% and 95.30%, respectively. Our method can provide an accurate and fast volume segmentation tool for 3DUS and it also has the potential to be applied to other imaging modalities.