A Joint 3D+2D Fully Convolutional Framework for Subcortical Segmentation

Wu, Jiong; Zhang, Yue; Tang, Xiaoying

doi:10.1007/978-3-030-32248-9_34

Jiong Wu^16,17,
Yue Zhang^16,18 &
Xiaoying Tang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11766))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

10k Accesses

Abstract

In this paper, we proposed and validated a novel joint 3D+2D fully convolutional framework for segmenting subcortical structures from magnetic resonance images (MRIs). A 2D Attention U-net (AU-net) following a multi-atlas guided 3D fully convolutional network (MF-net) is constructed in the proposed framework. A novel multi-atlas based encoding block for learning both prior expert information and MRI intensity profile, and a novel attention block for learning structural boundary information are respectively proposed in the 3D MF-net and the 2D AU-net. In the joint 3D+2D framework, the to-be-segmented image and the 2D probability maps for each structure of interest (obtained from the 3D MF-net) were sliced into a 2D image set at each of the three orthogonal views (axial, sagittal, coronal) and then fed into three trained 2D AU-nets, which yields superior segmentation performance. Validation experiments were performed on two datasets respectively contain 16 and 18 T1-weighted MRIs. Compared to several existing state-of-the-art segmentation methods including a multi-atlas joint label fusion method and three representative fully convolutional network methods, the proposed method performed significantly better for a majority of the 12 subcortical structures, with the overall mean Dice scores being respective 0.917 and 0.865 for the two datasets.

You have full access to this open access chapter, Download conference paper PDF

Multi-atlas subcortical segmentation: an orchestration of 3D fully convolutional network and generalized mixture function

Article 10 July 2023

Automatic segmentation of white matter lesions on multi-parametric MRI: convolutional neural network versus vision transformer

Article Open access 03 January 2025

On the Compactness, Efficiency, and Representation of 3D Convolutional Networks: Brain Parcellation as a Pretext Task

Keywords

1 Introduction

Accurately segmenting subcortical structures from brain magnetic resonance images (MRIs) is a critical step in various clinical applications such as biomarker identification [1]. Manual segmentation is considered as the golden standard. However, it is extremely time-consuming and of low reproducibility, urging fully-automated segmentation methods.

These days, deep learning based segmentation has became one of the mainstreams. In particular, fully convolutional network (FCN) is the tool of choice for many image segmentation tasks in medical imaging [2, 3]. Training and testing utilizing 3D or 2D patches is adopted in most FCN architectures. Typically, 3D patches based FCNs perform better than 2D ones since they provide additional context information in the direction orthogonal of 2D patches. However, it is difficult to identify suitable patch size for 3D FCNs to ensure a sufficient learning of all features, especially the boundary features. To address this issue, in this work, we propose a novel method in a joint 3D+2D fully convolutional framework for segmenting subcortical structures.

In the past decade, due to the ability of introducing prior expert knowledge, multi-atlas based segmentation methods have shown great performance [4, 5]. When combined with FCN, multi-atlas based segmentation methods performed even better [6]. A key component in multi-atlas based methods is image registration, for which non-linear algorithms usually deliver more accurate results, such as the large deformation diffeomorphic metric mapping (LDDMM) [7] which we will employ in our proposed segmentation framework. We develop a multi-atlas guided 3D FCN (MF-net) which consists of a multi-atlas based encoding block and two encoding-decoding blocks. To introduce prior information into the proposed 3D MF-net, each testing and training sample consists of a 3D image patch and several most similar atlas patches (including both image patches and the corresponding label patches) obtained from the transformed atlases.

To further excavate structure-specific boundary features and improve the segmentation accuracy, in this work, all 2D slices of the to-be-segmented image and the corresponding structure-specific 2D probability maps obtained from the 3D MF-net at three orthogonal views (axial, sagittal, coronal) are fed into three trained 2D networks to obtain the final probability maps of all subcortical structures. After a fusion operation, the final segmentation map can be obtained. In one previous work on pancreas segmentation [8], an architectural component called Attention block was introduced into a FCN architectures and was found to achieve promising segmentation results. An Attention block can improve model sensitivity and accuracy for dense label prediction by suppressing feature activations at irrelevant regions. As such, we integrate an Attention block into U-net [2], forming network, namely the 2D AU-net.

2 Method

The proposed joint 3D+2D fully convolutional segmentation pipeline, as shown in Fig. 1, consists of a preprocessing stage, a training stage and a testing stage. In the preprocessing, we first perform histogram matching between a preselected image, denoted as $I_t$, and all remaining ones of the training dataset. Then a number of atlases are randomly selected from the training dataset and are considered as the template atlases, and the remaining atlases of the training dataset are considered as the target atlases. After that, each template image is aligned to each of the target images using affine registration followed by a fast version of LDDMM [7].

At the training stage, Dense training strategy is adopted. To be specific, there are N training epochs each of which again consists of n subepochs. In each subepoch, $\mathcal {N}$ 3D target patches whose centers are located at each subcortical structure of interest and the background are sampled with equal probability from every target image. For each target patch $P_I$ extracted from the target image I, as depicted in Fig. 1, the cross-correlation (CC) between $P_I$ and each of the 3D neighboring patches $P_J$ in the transformed template image J within a 3D cubic searching region S is calculated as below,

$$\begin{aligned} CC=\frac{\sum _y\left( \left( P_I(y)-\mu _{P_I}\right) \left( P_J(y)-\mu _{P_J}\right) \right) ^2}{\sum _y\left( P_I(y)-\mu _{P_I}\right) ^2\sum _y{\left( P_J(y)-\mu _{P_J}\right) ^2}} \end{aligned}$$

(1)

where $\mu _{P_I}$ and $\mu _{P_J}$ respectively denote the mean image intensity of patch $P_I$ and $P_J$, and y denotes each voxel in the corresponding patch. The larger the CC, the more similar the two patches. For each target patch, we then select K most similar atlas patches according to the CC values.

Each target patch and the selected K atlas patches together form a single sample. All such samples are then used to train the 3D MF-net. After that, both the 12 structure-specific probability maps of the target image obtained from the trained 3D MF-net and the target images are sliced into 2D image sets at three orthogonal views (axial, sagittal, coronal) and are fed into three 2D AU-nets for training.

At the testing stage, the same preprocessing procedures are conducted including histogram matching between $I_t$ and the testing image, registration between the testing image and all template images, as well as testing sample creation (identifying the K most similar atlas patches for each testing patch). As shown in Fig. 1, each to-be-segmented patch and the corresponding K atlas patches of the testing image are fed into the trained 3D MF-net to obtain the 12 structure-specific probability maps. After the 3D-to-2D slicing operation, the 2D image sets at each of the three orthogonal views are input to the three trained 2D AU-nets to get the final 36 (12 structure-specific probability maps multiply 3 AU-nets) probability maps. After the fusion step, the final segmentation map of the testing image is obtained.

2.1 3D MF-net

Architecture of the proposed 3D MF-net is depicted in Fig. 2(a). It consists of a multi-atlas based encoding block (MA-block) and two encoding-decoding blocks. In the MA-block, the training image patch and the K most similar atlas patches are separately learned. Specifically, each transformed template image patch and the corresponding label patch are concatenated in a multi-channel manner. Then, after two convolutional layers and one max-pooling layer, all the $K+1$ patches are concatenated and fed to the first encoding-decoding block, and the K atlas patches are concatenated and fed to the second encoding-decoding block. This multi-atlas guided strategy has three important advantages over other single patch learning approaches: (1) it introduces each patch’s label information into the network; (2) concatenating $K+1$ patches ensures an extraction of more relevant context information around the specific structure of interest which improves the segmentation robustness; (3) concatenating K atlas patches accelerates convergence of the training process.

As shown in Fig. 2, there are two encoding-decoding blocks in the proposed 3D MF-net. In the top encoding-decoding block, a 3D FCN including seven convolutional layers and eight deconvolutional layers are employed. Each convolutional layer contains several 3D convolutional filters (or kernels). In the bottom block, the construction of convolutional layers is the same as that of the top encoding-decoding block. To incorporate both local and global contextual information, we adopt long skip concatenations in the top encoding-decoding block. To further alleviate the gradient vanishing issue and accelerate the convergence of the 3D MF-net, we adopt cross block summation between these two encoding-decoding blocks.

2.2 2D AU-net

The architectures of the three 2D AU-nets are depicted in Fig. 2(c). All these 2D AU-nets adopt the same architecture each consisting of eight convolutional layers and six deconvolutional layers. The closer a voxel is to the subcortical structure of interest in a probability map, the greater its probability is. The probability maps provide additional location information, and thus we slice all 12 structure-specific probability maps and the to-be-segmented image at each of the three orthogonal views to obtain three 2D image sets, which are then fed into three 2D AU-nets to improve the segmentation accuracy.

We build our Attention block on top of the U-net architecture, as shown in Fig. 2(b). It consists of four convolutional layers. Given activations of convolutional layer $F_{l-1}$ and deconvolutional layer $G_{l^{'}-1}$ (Fig. 2(b)), the first two convolutional layers $Conv_l$ and $Conv_{l^{'}}$ apply $1\times 1\times 1$ filters to all channels of $F_{l-1}$ and $G_{l^{'}-1}$ and produce $F_l$ and $G_{l^{'}}$. After a subtraction operation and an addition operation between these two layers, the second two convolutional layers $Conv_{l+1}$ and $Conv_{l^{'}+1}$ apply $1\times 1\times 1$ filters to all channels of $F_l-G_{l^{'}}$ and $F_l+G_{l^{'}}$. They are followed by two element-wise ReLU $\mathcal {L}$ and an addition operation, and then an element-wise sigmoid $\sigma $ that normalizes the channel coefficients for each voxel adding up to one is applied. The normalized output $\omega _l=[\omega _l^1, \omega _l^2,\dots ,\omega _l^K]\in {\mathbb {R}^{W\times {H}\times {D}}}$ is obtained as

$$\begin{aligned} \begin{aligned} \omega _l=\sigma (&\mathcal {L}(Conv_{l+1}(Conv_l(F_{l-1})-Conv_{l^{'}}(G_{l^{'}-1})))+ \\&\mathcal {L}(Conv_{l^{'}+1}(Conv_l(F_{l-1})+Conv_{l^{'}}(G_{l^{'}-1})))), \end{aligned} \end{aligned}$$

(2)

where $W\times {H}$ denotes the size of channels (feature maps) and D denotes the number of channels. Thus the final output of the Attention block can be computed as $\hat{F}_{l-1}=\omega _l\cdot {F_{l-1}}$, where $\cdot $ denotes an element-wise multiplication.

2.3 Implementation Details

In this work, to balance network performance and the total number of the parameters, we set $K=3$ for atlas patch selection. Parameters in LDDMM are set to be the same as those proposed in [7]. Parameter optimization in the 3D MF-net and 2D AU-net are performed using stochastic gradient descent, with cross-entropy being the cost function. In the Dense training process, we empirically set $N=20$, $n=15$ and $\mathcal {N}=100{n_t}$ ($n_t$ denotes the total number of target atlases) for training the 3D MF-net. The size of the training patches is set to be $31\times 31\times 31$.

3 Experimental Results

Datasets: We use two datasets to evaluate the performance of proposed method. The first dataset came from the PREDICT-HD study^{Footnote 1}. This dataset is employed to compare the performance of the proposed method with another two state-of-the-art segmentation methods including joint label fusion (JLF) [4] and LiviaNET [9], as well as the method involving only the proposed 3D MF-net. This dataset includes 16 T1-weighted images (T1-WIs) with an image size of $190\times 230\times 180$ mm$^3$ and a voxel resolution of $1.0\times 1.0\times 1.0$ mm$^3$. For each of the 16 images, 12 subcortical structures (bilateral caudate, pallidum, putamen, thalamus, amygdala, and hippocampus) have been manually delineated. A 4-fold cross-validation strategy is employed, wherein each fold consists of 10 training images (4 images are randomly selected as the template, and the other 6 as the target), 2 validation images and 4 testing images.

The second dataset is the Internet Brain Segmentation Repository (IBSR)^{Footnote 2}. A set of 18 T1-WIs (image sizes: $256\times 256\times 128$ mm$^3$, voxels resolutions: $0.8\times 0.8\times 1.5$ mm$^3$ to $1.0\times 1.0\times 1.5$ mm$^3$) is employed to compare the proposed method against JLF and another two published FCN approaches for subcortical segmentation [10, 11]. This dataset originally had 14 subcortical structures manually defined, including the aforementioned 12 subcortical structures and the bilateral accumbens which we do not consider in our segmentation task. A 6-fold cross-validation strategy is employed, wherein each fold consists of 12 training images (4 images are randomly selected as the template, and the other 8 as the target), 3 validation images and 3 testing images. We use the Dice Similarity Coefficient (DSC) for performance evaluation.

Evaluation Results: For the first dataset, the mean and standard deviations of the DSC values, for each of the 12 subcortical structures, produced by JLF, LiviaNET, 3D MF-net and the proposed method are listed in Table 1. Evidently, our proposed method is the best in segmenting all 12 structures. For the left putamen, the bilateral thalamus, the left amygdala and the right hippocampus, the DSC values of the proposed method are significantly higher than those of all other three methods, at a p-value of 0.05 (Wilcoxon signed-rank). Comparing 3D MF-net and LiviaNET, 3D MF-net performs statistically significantly better for all 12 structures except the right pallidum ($p<0.05$), especially for the right thalamus and the left hippocampus ($p<8.5e^{-4}$).

A visual comparison of the segmentation results of the four aforementioned automated methods and the manual delineation is demonstrated in Fig. 3. Clearly, the segmentation results of the proposed method are the closest to the manual delineated results. Unlike LiviaNET, the segmentation results of 3D MF-net and the proposed method do not have isolated regions, suggesting the capability of the proposed network architecture in maintaining spatial consistency. Compared to 3D MF-net, the proposed method delivers segmentation results that are smoother, suggesting that the proposed 2D AU-net plays a positive role in improving the segmentation performance.

Table 1. The mean and standard deviations of the DSC values for each of the 12 subcortical structures on the first dataset obtained from different methods. Bold typesetting indicates the best methods ($p<0.05$).

Full size table

Table 2 shows the mean DSC values on the second dataset for each of the evaluated methods. Compared to JLF, both 3D MF-net and the proposed method have a better performance in segmenting all subcortical structures. The overall DSC of the proposed method is significantly higher than that of JLF at $p=1.2e^{-5}$, with the mean DSCs being respective 0.827 and 0.865 for JLF and our method. It clearly demonstrates the robustness of the proposed approach. Compared to the method proposed in [10], our method performs better on all 12 structures. Furthermore, the performance of the proposed approach is also superior to that of the method proposed in [11] on all 12 structures other than the right amygdala. The overall segmentation performance of the proposed method is much better than that of either [10] or [11], with the mean DSCs being respective 0.865, 0.841, and 0.859. For both datasets, the proposed approach outperforms 3D MF-net, clearly demonstrating the effectiveness of the proposed 2D AU-net.

Table 2. The mean and standard deviations of the DSC values for each of the 12 subcortical structures on the second dataset obtained from different methods. Bold typesetting indicates the highest DSC.

Full size table

4 Conclusion

We proposed a novel and fully-automatic segmentation method for subcortical structures by combining a 2D Attention U-net with a multi-atlas guided 3D FCN. To introduce expert prior information to FCN, we designed a multi-atlas based encoding block in the 3D MF-net. In the architecture of the 2D AU-net, a novel Attention block was constructed and added on top of U-net to further improve the segmentation accuracy. We demonstrated the superior segmentation performance of the proposed approach on two different datasets and identified the importance of the 2D AU-net component. This innovative 3D+2D joint approach also provides new insight for other medical imaging segmentation tasks.

Notes

References

Tang, X., Qin, Y., Wu, J., et al.: Shape and diffusion tensor imaging based integrative analysis of the hippocampus and the amygdala in Alzheimer’s disease. Magn. Reson. Imaging 34(8), 1087–1099 (2016)
Article Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Kamnitsas, K., et al.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017)
Article Google Scholar
Wang, H., et al.: Multi-atlas segmentation with joint label fusion. IEEE Trans. Pattern Anal. Mach. Intell. 35(3), 611–623 (2013)
Article Google Scholar
Wu, G., et al.: Hierarchical multi-atlas label fusion with multi-scale feature representation and label-specific patch partition. NeuroImage 106, 34–46 (2015)
Article Google Scholar
Fang, L., et al.: Brain image labeling using multi-atlas guided 3D fully convolutional networks. In: Wu, G., Munsell, B.C., Zhan, Y., Bai, W., Sanroma, G., Coupé, P. (eds.) Patch-MI 2017. LNCS, vol. 10530, pp. 12–19. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67434-6_2
Chapter Google Scholar
Wu, J., Tang, X.: Fast diffeomorphic image registration via GPU-based parallel computing: an investigation of the matching cost function. In: Medical Imaging 2018: Image Processing, vol. 10574, p. 105742S. International Society for Optics and Photonics (2018)
Google Scholar
Oktay, O., et al.: Attention U-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
Dolz, J., Desrosiers, C., Ayed, I.B.: 3D fully convolutional networks for subcortical segmentation in MRI: a large-scale study. NeuroImage 170, 456–470 (2018)
Article Google Scholar
Mehta, R., Majumdar, A., Sivaswamy, J.: BrainSegNet: a convolutional neural network architecture for automated segmentation of human brain structures. J. Med. Imaging 4(2), 024003 (2017)
Article Google Scholar
Kushibar, K., et al.: Automated sub-cortical brain structure segmentation combining spatial and deep convolutional features. Med. Image Anal. 48, 177–186 (2018)
Article Google Scholar

Download references

Acknowledgments

This study was supported by the National Natural Science Foundation of China (NSFC 81501546) and the National Key R&D Program of China (2017YFC0112404).

Author information

Authors and Affiliations

Department of Electrical and Electronic Engineering, Southern University of Science and Technology, Shenzhen, China
Jiong Wu, Yue Zhang & Xiaoying Tang
School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, China
Jiong Wu
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China
Yue Zhang

Authors

Jiong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoying Tang .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, J., Zhang, Y., Tang, X. (2019). A Joint 3D+2D Fully Convolutional Framework for Subcortical Segmentation. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11766. Springer, Cham. https://doi.org/10.1007/978-3-030-32248-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-32248-9_34
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32247-2
Online ISBN: 978-3-030-32248-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)