Keywords

1 Introduction

Accurately segmenting subcortical structures from brain magnetic resonance images (MRIs) is a critical step in various clinical applications such as biomarker identification [1]. Manual segmentation is considered as the golden standard. However, it is extremely time-consuming and of low reproducibility, urging fully-automated segmentation methods.

These days, deep learning based segmentation has became one of the mainstreams. In particular, fully convolutional network (FCN) is the tool of choice for many image segmentation tasks in medical imaging [2, 3]. Training and testing utilizing 3D or 2D patches is adopted in most FCN architectures. Typically, 3D patches based FCNs perform better than 2D ones since they provide additional context information in the direction orthogonal of 2D patches. However, it is difficult to identify suitable patch size for 3D FCNs to ensure a sufficient learning of all features, especially the boundary features. To address this issue, in this work, we propose a novel method in a joint 3D+2D fully convolutional framework for segmenting subcortical structures.

In the past decade, due to the ability of introducing prior expert knowledge, multi-atlas based segmentation methods have shown great performance [4, 5]. When combined with FCN, multi-atlas based segmentation methods performed even better [6]. A key component in multi-atlas based methods is image registration, for which non-linear algorithms usually deliver more accurate results, such as the large deformation diffeomorphic metric mapping (LDDMM) [7] which we will employ in our proposed segmentation framework. We develop a multi-atlas guided 3D FCN (MF-net) which consists of a multi-atlas based encoding block and two encoding-decoding blocks. To introduce prior information into the proposed 3D MF-net, each testing and training sample consists of a 3D image patch and several most similar atlas patches (including both image patches and the corresponding label patches) obtained from the transformed atlases.

To further excavate structure-specific boundary features and improve the segmentation accuracy, in this work, all 2D slices of the to-be-segmented image and the corresponding structure-specific 2D probability maps obtained from the 3D MF-net at three orthogonal views (axial, sagittal, coronal) are fed into three trained 2D networks to obtain the final probability maps of all subcortical structures. After a fusion operation, the final segmentation map can be obtained. In one previous work on pancreas segmentation [8], an architectural component called Attention block was introduced into a FCN architectures and was found to achieve promising segmentation results. An Attention block can improve model sensitivity and accuracy for dense label prediction by suppressing feature activations at irrelevant regions. As such, we integrate an Attention block into U-net [2], forming network, namely the 2D AU-net.

2 Method

The proposed joint 3D+2D fully convolutional segmentation pipeline, as shown in Fig. 1, consists of a preprocessing stage, a training stage and a testing stage. In the preprocessing, we first perform histogram matching between a preselected image, denoted as \(I_t\), and all remaining ones of the training dataset. Then a number of atlases are randomly selected from the training dataset and are considered as the template atlases, and the remaining atlases of the training dataset are considered as the target atlases. After that, each template image is aligned to each of the target images using affine registration followed by a fast version of LDDMM [7].

Fig. 1.
figure 1

Illustration of the proposed joint 3D+2D segmentation pipeline

At the training stage, Dense training strategy is adopted. To be specific, there are N training epochs each of which again consists of n subepochs. In each subepoch, \(\mathcal {N}\) 3D target patches whose centers are located at each subcortical structure of interest and the background are sampled with equal probability from every target image. For each target patch \(P_I\) extracted from the target image I, as depicted in Fig. 1, the cross-correlation (CC) between \(P_I\) and each of the 3D neighboring patches \(P_J\) in the transformed template image J within a 3D cubic searching region S is calculated as below,

$$\begin{aligned} CC=\frac{\sum _y\left( \left( P_I(y)-\mu _{P_I}\right) \left( P_J(y)-\mu _{P_J}\right) \right) ^2}{\sum _y\left( P_I(y)-\mu _{P_I}\right) ^2\sum _y{\left( P_J(y)-\mu _{P_J}\right) ^2}} \end{aligned}$$
(1)

where \(\mu _{P_I}\) and \(\mu _{P_J}\) respectively denote the mean image intensity of patch \(P_I\) and \(P_J\), and y denotes each voxel in the corresponding patch. The larger the CC, the more similar the two patches. For each target patch, we then select K most similar atlas patches according to the CC values.

Each target patch and the selected K atlas patches together form a single sample. All such samples are then used to train the 3D MF-net. After that, both the 12 structure-specific probability maps of the target image obtained from the trained 3D MF-net and the target images are sliced into 2D image sets at three orthogonal views (axial, sagittal, coronal) and are fed into three 2D AU-nets for training.

At the testing stage, the same preprocessing procedures are conducted including histogram matching between \(I_t\) and the testing image, registration between the testing image and all template images, as well as testing sample creation (identifying the K most similar atlas patches for each testing patch). As shown in Fig. 1, each to-be-segmented patch and the corresponding K atlas patches of the testing image are fed into the trained 3D MF-net to obtain the 12 structure-specific probability maps. After the 3D-to-2D slicing operation, the 2D image sets at each of the three orthogonal views are input to the three trained 2D AU-nets to get the final 36 (12 structure-specific probability maps multiply 3 AU-nets) probability maps. After the fusion step, the final segmentation map of the testing image is obtained.

2.1 3D MF-net

Architecture of the proposed 3D MF-net is depicted in Fig. 2(a). It consists of a multi-atlas based encoding block (MA-block) and two encoding-decoding blocks. In the MA-block, the training image patch and the K most similar atlas patches are separately learned. Specifically, each transformed template image patch and the corresponding label patch are concatenated in a multi-channel manner. Then, after two convolutional layers and one max-pooling layer, all the \(K+1\) patches are concatenated and fed to the first encoding-decoding block, and the K atlas patches are concatenated and fed to the second encoding-decoding block. This multi-atlas guided strategy has three important advantages over other single patch learning approaches: (1) it introduces each patch’s label information into the network; (2) concatenating \(K+1\) patches ensures an extraction of more relevant context information around the specific structure of interest which improves the segmentation robustness; (3) concatenating K atlas patches accelerates convergence of the training process.

Fig. 2.
figure 2

(a) Architecture of the proposed 3D MF-net. (b) The Attention block. (c) The proposed 2D AU-nets.

As shown in Fig. 2, there are two encoding-decoding blocks in the proposed 3D MF-net. In the top encoding-decoding block, a 3D FCN including seven convolutional layers and eight deconvolutional layers are employed. Each convolutional layer contains several 3D convolutional filters (or kernels). In the bottom block, the construction of convolutional layers is the same as that of the top encoding-decoding block. To incorporate both local and global contextual information, we adopt long skip concatenations in the top encoding-decoding block. To further alleviate the gradient vanishing issue and accelerate the convergence of the 3D MF-net, we adopt cross block summation between these two encoding-decoding blocks.

2.2 2D AU-net

The architectures of the three 2D AU-nets are depicted in Fig. 2(c). All these 2D AU-nets adopt the same architecture each consisting of eight convolutional layers and six deconvolutional layers. The closer a voxel is to the subcortical structure of interest in a probability map, the greater its probability is. The probability maps provide additional location information, and thus we slice all 12 structure-specific probability maps and the to-be-segmented image at each of the three orthogonal views to obtain three 2D image sets, which are then fed into three 2D AU-nets to improve the segmentation accuracy.

We build our Attention block on top of the U-net architecture, as shown in Fig. 2(b). It consists of four convolutional layers. Given activations of convolutional layer \(F_{l-1}\) and deconvolutional layer \(G_{l^{'}-1}\) (Fig. 2(b)), the first two convolutional layers \(Conv_l\) and \(Conv_{l^{'}}\) apply \(1\times 1\times 1\) filters to all channels of \(F_{l-1}\) and \(G_{l^{'}-1}\) and produce \(F_l\) and \(G_{l^{'}}\). After a subtraction operation and an addition operation between these two layers, the second two convolutional layers \(Conv_{l+1}\) and \(Conv_{l^{'}+1}\) apply \(1\times 1\times 1\) filters to all channels of \(F_l-G_{l^{'}}\) and \(F_l+G_{l^{'}}\). They are followed by two element-wise ReLU \(\mathcal {L}\) and an addition operation, and then an element-wise sigmoid \(\sigma \) that normalizes the channel coefficients for each voxel adding up to one is applied. The normalized output \(\omega _l=[\omega _l^1, \omega _l^2,\dots ,\omega _l^K]\in {\mathbb {R}^{W\times {H}\times {D}}}\) is obtained as

$$\begin{aligned} \begin{aligned} \omega _l=\sigma (&\mathcal {L}(Conv_{l+1}(Conv_l(F_{l-1})-Conv_{l^{'}}(G_{l^{'}-1})))+ \\&\mathcal {L}(Conv_{l^{'}+1}(Conv_l(F_{l-1})+Conv_{l^{'}}(G_{l^{'}-1})))), \end{aligned} \end{aligned}$$
(2)

where \(W\times {H}\) denotes the size of channels (feature maps) and D denotes the number of channels. Thus the final output of the Attention block can be computed as \(\hat{F}_{l-1}=\omega _l\cdot {F_{l-1}}\), where \(\cdot \) denotes an element-wise multiplication.

2.3 Implementation Details

In this work, to balance network performance and the total number of the parameters, we set \(K=3\) for atlas patch selection. Parameters in LDDMM are set to be the same as those proposed in [7]. Parameter optimization in the 3D MF-net and 2D AU-net are performed using stochastic gradient descent, with cross-entropy being the cost function. In the Dense training process, we empirically set \(N=20\), \(n=15\) and \(\mathcal {N}=100{n_t}\) (\(n_t\) denotes the total number of target atlases) for training the 3D MF-net. The size of the training patches is set to be \(31\times 31\times 31\).

3 Experimental Results

Datasets: We use two datasets to evaluate the performance of proposed method. The first dataset came from the PREDICT-HD studyFootnote 1. This dataset is employed to compare the performance of the proposed method with another two state-of-the-art segmentation methods including joint label fusion (JLF) [4] and LiviaNET [9], as well as the method involving only the proposed 3D MF-net. This dataset includes 16 T1-weighted images (T1-WIs) with an image size of \(190\times 230\times 180\) mm\(^3\) and a voxel resolution of \(1.0\times 1.0\times 1.0\) mm\(^3\). For each of the 16 images, 12 subcortical structures (bilateral caudate, pallidum, putamen, thalamus, amygdala, and hippocampus) have been manually delineated. A 4-fold cross-validation strategy is employed, wherein each fold consists of 10 training images (4 images are randomly selected as the template, and the other 6 as the target), 2 validation images and 4 testing images.

The second dataset is the Internet Brain Segmentation Repository (IBSR)Footnote 2. A set of 18 T1-WIs (image sizes: \(256\times 256\times 128\) mm\(^3\), voxels resolutions: \(0.8\times 0.8\times 1.5\) mm\(^3\) to \(1.0\times 1.0\times 1.5\) mm\(^3\)) is employed to compare the proposed method against JLF and another two published FCN approaches for subcortical segmentation [10, 11]. This dataset originally had 14 subcortical structures manually defined, including the aforementioned 12 subcortical structures and the bilateral accumbens which we do not consider in our segmentation task. A 6-fold cross-validation strategy is employed, wherein each fold consists of 12 training images (4 images are randomly selected as the template, and the other 8 as the target), 3 validation images and 3 testing images. We use the Dice Similarity Coefficient (DSC) for performance evaluation.

Evaluation Results: For the first dataset, the mean and standard deviations of the DSC values, for each of the 12 subcortical structures, produced by JLF, LiviaNET, 3D MF-net and the proposed method are listed in Table 1. Evidently, our proposed method is the best in segmenting all 12 structures. For the left putamen, the bilateral thalamus, the left amygdala and the right hippocampus, the DSC values of the proposed method are significantly higher than those of all other three methods, at a p-value of 0.05 (Wilcoxon signed-rank). Comparing 3D MF-net and LiviaNET, 3D MF-net performs statistically significantly better for all 12 structures except the right pallidum (\(p<0.05\)), especially for the right thalamus and the left hippocampus (\(p<8.5e^{-4}\)).

A visual comparison of the segmentation results of the four aforementioned automated methods and the manual delineation is demonstrated in Fig. 3. Clearly, the segmentation results of the proposed method are the closest to the manual delineated results. Unlike LiviaNET, the segmentation results of 3D MF-net and the proposed method do not have isolated regions, suggesting the capability of the proposed network architecture in maintaining spatial consistency. Compared to 3D MF-net, the proposed method delivers segmentation results that are smoother, suggesting that the proposed 2D AU-net plays a positive role in improving the segmentation performance.

Table 1. The mean and standard deviations of the DSC values for each of the 12 subcortical structures on the first dataset obtained from different methods. Bold typesetting indicates the best methods (\(p<0.05\)).
Fig. 3.
figure 3

A comparison of the segmentation results of the 12 subcortical structures, obtained from four automated methods and manual delineation, of one representative image from the first dataset.

Table 2 shows the mean DSC values on the second dataset for each of the evaluated methods. Compared to JLF, both 3D MF-net and the proposed method have a better performance in segmenting all subcortical structures. The overall DSC of the proposed method is significantly higher than that of JLF at \(p=1.2e^{-5}\), with the mean DSCs being respective 0.827 and 0.865 for JLF and our method. It clearly demonstrates the robustness of the proposed approach. Compared to the method proposed in [10], our method performs better on all 12 structures. Furthermore, the performance of the proposed approach is also superior to that of the method proposed in [11] on all 12 structures other than the right amygdala. The overall segmentation performance of the proposed method is much better than that of either [10] or [11], with the mean DSCs being respective 0.865, 0.841, and 0.859. For both datasets, the proposed approach outperforms 3D MF-net, clearly demonstrating the effectiveness of the proposed 2D AU-net.

Table 2. The mean and standard deviations of the DSC values for each of the 12 subcortical structures on the second dataset obtained from different methods. Bold typesetting indicates the highest DSC.

4 Conclusion

We proposed a novel and fully-automatic segmentation method for subcortical structures by combining a 2D Attention U-net with a multi-atlas guided 3D FCN. To introduce expert prior information to FCN, we designed a multi-atlas based encoding block in the 3D MF-net. In the architecture of the 2D AU-net, a novel Attention block was constructed and added on top of U-net to further improve the segmentation accuracy. We demonstrated the superior segmentation performance of the proposed approach on two different datasets and identified the importance of the 2D AU-net component. This innovative 3D+2D joint approach also provides new insight for other medical imaging segmentation tasks.