1 Introduction

Accurate segmentation of cardiac magnetic resonance (CMR) images is fundamental for assessing cardiac morphology and diagnosing heart conditions [10]. Manual segmentation of the anatomical structures is tedious, time-consuming and prone to subjective errors, which is not suitable for large-scale studies such as UK BiobankFootnote 1 [1]. Therefore, it is essential to develop automated, fast and accurate CMR segmentation techniques.

Recently, convolutional neural network (CNN) based methods have achieved very good performance for cardiac image segmentation in terms of both speed and accuracy [1, 2, 12]. However, they may still produce sub-optimal segmentation results in some circumstances. For example, in the Automatic Cardiac Diagnosis Challenge (ACDC) [2], the top segmentation methods (all CNN-based) achieve high overall segmentation scores for mid-ventricular short-axis slices. However, they sometimes produce poor results or even fail to locate the myocardium in basal slices (due to its more complex shape) and apical slices (due to its small size). This problem is not uncommon and has been reported in the related literature [2, 7, 15]. Methods based on 2D networks, trained in a slice-by-slice fashion, are particularly affected by this problem since they do not incorporate spatial context from neighboring SA images or long-axis (LA) views. On the other hand, 3D networks are capable of incorporating 3D spatial information to perform the segmentation task. Yet the 3D spatial context can be affected by potential inter-slice motion artefacts [13] and the low through-plane spatial resolution in cardiac SA stacks, thus limiting their segmentation performance. Compared to 2D ones, 3D networks usually contain more parameter and are prone to over-fitting especially when the training set is limited in size since they use 3D volumes rather than 2D slices as input, significantly reducing the number of training samples.

Experienced clinicians are able to assess the cardiac morphology and function from multiple standard views, using both SA and LA images to form an understanding of the cardiac anatomy. Inspired by this, we propose a method which learns the anatomical prior knowledge across four standard views and leverages this to perform segmentation on 2D SA images. The intuition behind our work is that the representation learnt from multiple standard views is beneficial for the segmentation task on the SA slices as different views should share the same representation of the 3D anatomy if they are from the same subject.

The main contributions of this paper are the following: (a) we developed a novel autoencoder architecture (Shape MAE) which learns latent representation of cardiac shapes from multiple standard views; (b) we developed a segmentation network (multi-view U-Net, adapted from [11]), which is capable of incorporating the anatomical shape priors learned from multi-view images to guide the segmentation on SA images; (c) we assessed the segmentation accuracy and the data efficiency of the proposed segmentation method against common 2D and 3D segmentation baselines by limiting the number of training images, demonstrating that the proposed method is more robust, and less dependent on the size of training data.

Related Literature. A large number of methods have been developed to improve the robustness of the cardiac segmentation. One approach is to learn an ensemble model where the predictions of a 2D and a 3D network are combined [6]. This method is capable of producing accurate results, but has a relatively high computational cost and requires an extra post-processing step to merge the predictions from the two networks. Another approach is to incorporate cardiac anatomical prior knowledge into segmentation networks [5, 9]. In [9], the learned representation of the 3D cardiac shape is employed to constrain the segmentation model to predict anatomically plausible shapes. The main bottleneck of this method is the requirement of fully annotated 3D high-resolution CMR images which are free from inter-slice motion artefacts and have high through-plane spatial resolution. However, compared to the standard 2D imaging protocol, the 3D one requires the subjects to hold their breath for a relatively long time and therefore is often not feasible for patients with cardiovascular diseases. Instead of using 3D images, we exploit routinely acquired 2D standard views to learn the shape representation of the cardiac structures. The learned representation is then injected into a segmentation network to improve its performance on SA CMR images. Of note, the approach in [8] also injects shape priors produced from an autoencoder into a segmentation network. However, the aim of that approach is to generate multiple segmentation hypotheses for ambiguous images, and cannot be readily employed to learn shape priors from different views to enhance cardiac segmentation.

2 Methods

The proposed method consists of two novel architectures: (1) a shape-aware multi-view autoencoder (Shape MAE) which aims at learning anatomical shape priors from standard cardiac acquisition planes incl. short-axis and long-axis views and (2) a multi-view U-Net which performs cardiac short-axis image segmentation by incorporating anatomical priors learned by Shape MAE into a modified U-Net architecture.

Fig. 1.
figure 1

(a) Overview of Shape MAE. (b) Detailed architectures of each encoder and each decoder. Each rectangle represents one or a series of convolutional (Conv) or transposed convolutional (Deconv) layers where the number in the square box represents the number of filters for each layer. A ‘Res_block’ (pink rectangles) consists of two convolutional layers (\(3\times 3\)) with a residual connection which adds its input to the features from the second layer. Instance normalisation and leaky ReLU activation are applied throughout the network. A sigmoid function is applied to the latent code z to bound its range.

Shape MAE: Shape-Aware Multi-view Autoencoder. As illustrated in Fig. 1, we first present a novel architecture named shape-aware multi-view autoencoder (Shape MAE) which learns anatomical shape priors from standard cardiac views through multi-task learning. Given a source view \(X_i\), the network learns the low-dimensional representation \(z_i\) of \(X_i\) that best reconstructs all the j target views segmentations \(Y_j\). In this work, we employ four source views \(X_i \; (i=1,\dots , 4)\) which are three LA images - the two-chamber view (LA1), three-chamber view (LA2), the four-chamber view (LA3) - and one mid-ventricular slice (Mid-V) from the SA view. The target segmentations views \(Y_j\) (\(j=1,\dots , 6\)) correspond to the four previous views plus two SA slices: the apical one and the basal one. All encoders \(E_i: z_i=E_i(X_i)\) and all decoders \(D_j: Y_j=D_j(z_i)\) in the Shape MAE share the same architecture (see Fig. 1b).

The loss function \(\mathcal {L}_\text {Shape~MAE}\) for the whole network is defined as follows:

$$\begin{aligned} \mathcal {L}_\text {Shape~MAE}= \mathcal {L}_{intra} + \alpha \mathcal {L}_{inter} + \beta \mathcal {L}_{reg} \end{aligned}$$
(1)

The first two terms of Eq. 1 are defined as the cross entropy loss \(\mathcal {F}\) between the predicted myocardium segmentation \(\hat{Y}_{i\rightarrow j}=D_j(E_i(X_i))\) for the target view j given a source image \(X_i\) of the same subject and its ground truth segmentation \(Y_j\). \(\mathcal {L}_{intra}\) denotes the segmentation loss when the source view \(X_i\) and the target view \(Y_j\) correspond to the same view: \(\mathcal {L}_{intra}=\sum _{i=1, i=j}^{4}\mathcal {F}(Y_{j},\hat{Y}_{i\rightarrow j})\), whereas the second term \(\mathcal {L}_{inter}\) denotes the loss when two views are different: \(\mathcal {L}_{inter}=\sum _{i=1}^{4}\sum _{j=1, i \ne j}^{6} {\mathcal {F}}(Y_{j},\hat{Y}_{i\rightarrow j})\). The third term is a regularisation term on the latent representations \(z_i, z_i \in Z\): \(\mathcal {L}_{reg}= \frac{1}{|Z|} \sum _{i=1}^{4}{\left| \left| z_{i} -\bar{z} \right| \right| ^2}\), which penalises the L2 distance between \(z_i\) and \(\bar{z}\), with \(\bar{z} = \frac{1}{|Z|}\sum _{i=1}^{4}{z_i}\) being the average z for a subject. Although the latent shape codes from different views of the same subject are not directly shared, this regularisation term forces them to be close to each other. We use coefficients \(\alpha \) and \(\beta \) to control the relative importance of \(\mathcal {L}_{inter} \) and \(\mathcal {L}_{reg}\).

The principle behind the proposed network is that different views require independent functions to map them to the latent space that describes global shape characteristics; whereas translating this latent space to another view or plane also requires a specific projection function. Predicting the shape of the myocardium based on the six target views instead of a single view encourages the network to learn and exploit correlations between different views, resulting in a global, view-invariant shape representation rather than a local representation for a particular view. All the encoders and the decoders in this framework are trained jointly in a multi-task learning fashion, with the benefit of avoiding over-fitting and encouraging model generalisation [3].

Fig. 2.
figure 2

(a) Overview of the proposed MV U-Net. (b) Architecture of the ‘Fuse Block’. The number of shown feature map blocks of the U-Net is reduced for clarity of presentation. Batch normalisation and ReLU activations are applied throughout the network. For each subject, the shape code of each view is reshaped to \(1\times 4\times 8\times 8\) and then concatenated with the other three along the second axis to form an input of \(1\times 32\times 8\times 8\) to the Fuse Block.

MV U-Net: Multi-view U-Net. As shown in Fig. 2, we propose a segmentation network called multi-view U-Net (MV U-Net) based on the original U-Net [11] for cardiac SA image segmentation. The proposed network is capable of incorporating the anatomical shape priors learned by Shape MAE. Similar to the original architecture, the proposed architecture comprises 4 down-sampling blocks and 4 up-sampling blocks to learn multi-scale features. Differently from the original U-Net, we reduced the number of filters at each level by four times to account for the fact that cardiac segmentation is simpler than the lesion segmentation (with multiple candidates) which was the task that the original U-Net was applied to. In addition, a module called ‘Fuse Block’ is introduced in the bottleneck of the network (see Fig. 2b) to inject the latent codes into the segmentation network. This fusing approach is different from that in [8] where the latent codes are simply concatenated with U-Net activations. The proposed module consists of two convolutional (Conv) kernels (\(3\times 3\)) and a residual connection to combine the shape representations from different views through learnable weights. Thanks to this module, given an arbitrary short-axis image slice \(I^p\) from a subject p and its correspondent shape representations \({z_1^p,z_2^p,z_3^p,z_4^p}\) obtained by Shape MAE (one for each of the four standard views), the network can predict a segmentation by distilling the prior knowledge to the high-level features of the network, allowing it to efficiently refine the segmentations through multi-view information. The network is trained using standard training procedure with a cross entropy loss to optimise the parameters \(\theta \) of MV U-Net.

3 Experiments and Results

Cardiac Multi-view Image Dataset. Experiments were performed on a dataset acquired from 734 subjects. For each subject, a stack of 2D SA slices and three orthogonal 2D LA images are available. All the LV myocardium were annotated on the SA images as well as the LA images at the end-diastolic (ED) frame using an automated method followed by manual quality control. All the images were acquired using one scanner. The spatial resolution of the images is \(1.8 \times 1.8 \times 10\) mm.

In our experiments, the dataset was randomly split into two subsets: a training set (570 cases), a test set (164 cases). All LA images were registered to a template subject using rigid transformation with MIRTK toolkitFootnote 2. All 2D SA slices have been cropped to the size of \(128 \times 128\) pixels where the left ventricle is roughly in the center of every image. Benefiting from the view planning (which is a standard step during the cardiac image acquisition), we simply use the intersection point of the three orthogonal LA images on every SA slice to determine its center of the interest region. All the networks were trained for 200 epochs on an NVIDIA GeForce 2080 Ti, using an Adam optimizer with a batch size of 10. The learning rate for Shape MAE was set to 0.0001 whereas the learning rate for the segmentation network was set to 0.001. In our experiments, \(\alpha \) was empirically set to 0.5 and \(\beta \) to 0.001 in \(\mathcal {L}_\text {{Shape~MAE}}\). The proposed algorithm was implemented in Pytorch.

Segmentation Results. To evaluate the segmentation accuracy, we use two measurements: the Dice score and the Hausdorff distance (HD). The proposed method is compared against: a 2D U-Net [11], a state-of-the-art 2D FCN for cardiac MR image segmentation [1], and a 3D U-Net [4]. For fairness and ease of comparison, all models were set with the same number of filters at each level (starting with 16 filters in the first layer) and trained with the same pre-processing and training schedule. For the 3D network, we resampled SA images to a voxel size of \(1.8 \times 1.8 \times 1.8\) mm and cropped each to a size of \(128 \times 128 \times 64\) during pre-processing. We trained MV U-Net and the baseline networks with two settings: in one case we used 10% of the training set, while in the other one we used 100%. Of note, in each setting, we first trained a Shape MAE and then trained a MV U-Net where shape priors of four standard views were obtained using corresponding encoders in the Shape MAE.

Results on the test set are shown in Table 1. From the table, it can be observed that the proposed method outperforms the baseline models in both the low-data setting and the high-data setting, with improved Dice scores at the apex, middle, and base of the left ventricular myocardium. In particular, when only 10% data was used, the proposed method reduces the mean HD from 3.24 to 2.49 mm on the apical slices, from 2.34 to 2.09 on the middle slices and from 3.62 to 2.76 on the basal slices, compared to the 2D U-Net. Figure 3 shows examples of the segmentation results from all the networks where the proposed method not only produces more robust segmentation across slices compared to the results from the 2D networks, but also achieves more anatomically plausible results in comparison to the 3D one (see the red arrows in this figure). Visualization results of the segmentation networks trained in the high-data setting and Shape MAE are provided in the supplementary material.

Table 1. Comparison of the myocardium segmentation accuracy of the baseline models and the proposed method in terms of the mean and the standard deviation of Dice score and HD distance (mm) obtained on the test set (n = 164). The comparison has been carried out separately for apical, mid-ventricular, and basal slices.
Fig. 3.
figure 3

Visualisation of the predicted segmentations and correspondent ground truth (GT) from the baseline models and MV U-Net (all trained with 10% training subjects) on an apical, a mid-ventricular and a basal slice from one patient. Compared to the baseline models, MV U-Net produces more accurate segmentation with stronger spatial coherence.

4 Discussion and Conclusion

In this work, we presented a shape-aware multi-view autoencoder, a neural network capable of learning anatomical shape priors from multiple standard views, as well as a multi-view U-Net, a modification of the original U-Net architecture that incorporates the learned shape priors to improve the robustness of cardiac segmentation. In contrast to existing works which treat long-axis CMR segmentation and short-axis CMR segmentation as two separate tasks [1, 14], our approach, to the best of our knowledge, is the first that exploits the spatial context from the long-axis images to guide the segmentation on the short-axis images. The reported experimental results show that the proposed segmentation method not only demonstrates superior segmentation accuracy over state-of-the-art 2D baseline methods [1, 11], but also outperforms a 3D U-Net [4]. This improvement is particularly evident on the basal and apical slices in the low-data setting, as expected. When training data is limited, segmenting these challenging slices particularly benefits from the additional anatomical information extracted from the LA views and injected into the segmentation network. Of note, our approach does not require a dedicated acquisition protocol, since LA images are routinely acquired in most CMR imaging schemes. Moreover, the proposed MV U-Net maintains the computational advantage of a 2D network, using fewer parameters (\(\sim \)1.2 million weights) than the 3D U-Net (\(\sim \)2.5 million weights) during training. This advantage also contributes to the data efficiency of our method, achieving high segmentation performance with limited training data. Importantly, our method could be extended in the future to multi-structure cardiac segmentation. The proposed approach could also be potentially adopted to other medical image segmentation tasks.