Keywords

1 Introduction

Continuous and accurate atherosclerotic plaque detection and tracking is critical for coronary heart disease diagnosis and treatment [1]. Intravascular Optical Coherence Tomography (IVOCT) has higher resolution and feasibility than intravascular ultrasound, and is considered as the gold standard for the intravascular plaque analysis in clinical application [2]. Hence, plaque detection and tracking based on IVOCT is an important and valuable task in computer-aided coronary heart disease treatment field.

Fig. 1.
figure 1

A sequence of actions taken by the proposed framework to localize a plaque. The red sector is the ground truth region of a plaque. The blue sector is the detected region of the plaque after every action. The algorithm transforms sector region to achieve accurate detection based on RL. (Color figure online)

However, continuous and accurate plaque detection and tracking is very challenging because: (1) A lot of speckle noise on IVOCT images and the low contrast of plaque edges make it difficult to identify plaques without experts’ guide. (2) Complex and various intravascular morphology makes it difficult to achieve continuous and accurate plaque detection frame-by-frame. (3) In a single pullback, hundreds of IVOCT images are gotten, analysing images one-by-one is impossible without an efficient analysis method during the clinical routine.

Some methods were proposed to address plaque analysis problem on IVOCT images. In [1], Gessert et al. used convolution neural networks (CNNs) to address automatic plaque detection problem on IVOCT images. In [3], Ughi et al. proposed an algorithm to achieve the automated characterization of plaque tissue based on textural features on IVOCT images. In [4], Wang et al. used a gradient based level-set model to achieve semi-automatic segmentation and quantification of calcified plaques. In [5], Soest et al. proposed a method to achieve automatic classification of plaque constituents based on the optical attenuation coefficient. In [6], Abdolmanafi et al. used the CNNs as feature extractor to achieve automated plaque tissue classification. In [7], He et al. used CNNs to achieve automatic plaque characterization for IVOCT images. In [8], Oliveira et al. used CNNs to address coronary calcification identification problem.

In summary, computer aided plaque analysis based on IVOCT images is an emerging research area and has big study potential. Recently, some works [1, 6, 7] have proven that CNNs have big application potential on computer aided plaque analysis problem. However, to our best knowledge, most of existing methods focus on addressing plaque classification or identification problems. Few methods can address region-level plaque localization and scale-level quantification problem simultaneously. What’s more, no methods can achieve continuous plaque localization and scale quantification frame-by-frame (i.e., plaque tracking) with high accuracy (a plaque generally exists across consecutive frames). However, plaque tracking is a fundamental problem for computer aided plaque analysis task. Continuous and accurate plaque detection (including accurate localization and fine scale quantification) will significantly benefit the further plaque analysis tasks and has important clinical application value for coronary heart disease quantification, diagnosis, and treatment.

In this paper, inspired by the good performance of reinforcement learning (RL) method on continuous object tracking and detection task [9, 10], we propose a newly-designed framework based on RL to achieve accurate and continuous plaque tracking on large-scale IVOCT images. As shown in Fig. 1, the actions designed for IVOCT images can achieve accurate plaque detection progressively based on RL. The main contributions and characteristics of the proposed method are following four aspects: (a) For the first time, we proposed an RL-based framework to achieve accurate plaque tracking on IVOCT images. (b) The proposed framework models the spatio-temporal information of adjacent frames to achieve continuous and accurate plaque detection frame-by-frame, avoiding potential omissions. (c) The proposed method has strong expansibility, because the fully-automated and semi-automated tracking styles are both allowed to fit the clinical practice. (d) On the collected large-scale IVOCT data, the proposed method achieves high tracking accuracy.

Fig. 2.
figure 2

Architecture of the proposed framework. The proposed deep reinforcement framework can utilize the spatio-temporal location correlation information to achieve continuous and accurate plaque tracking.

2 Architecture of the Proposed Framework

The architecture of the proposed framework is illustrated in Fig. 2. The proposed framework includes three modules, i.e., the feature encoding module, the spatio-temporal correlation RL module (see Sect. 2.1), and the aided plaque localization and identification module (see Sect. 2.2). To achieve high computation efficiency, we design a simple CNNs structure as shown in Fig. 2 (Some new network structures, such as ResNet and DenseNet, can also be used, if computation power is sufficient). Specifically, in the feature encoding module, five convolution layers and one fully connected (FC) layer (i.e., FC1 layer) are used to achieve frame-by-frame feature encoding of IVOCT images. In an IVOCT image, the center coordinate is the localization of probe in physical space. Hence, we denote the detected plaque section as a sector with unified radius. The sector is represented as two-tuples \(d=(\varTheta _S, \varTheta )\), where \(\varTheta \) denotes the scale (included angle) of the detected sector, \(\varTheta _S\in [0,2\pi )\) denotes the localization (starting angle on the polar coordinate space) of the detected sector. In the following sections, we will introduce the details of the proposed framework.

2.1 Spatio-Temporal Correlation RL Module

To achieve continuous and accurate plaque tracking frame-by-frame without sampling or omissions, we formulate the plaque tracking task as an RL problem. This setting enables providing a continuous spatio-temporal location correlation to model an agent which makes a sequence of accurate actions to achieve accurate tracking. Our RL module is modeled based on FC2 layer (states are the input of FC2 layer and actions are the output of FC2 layer), and it considers a sequence of continuous IVOCT images as the environment, where the agent transforms a sector using a set of actions based on spatio-temporal location correlation information. The goal of the agent is to generate a tight sector in a plaque object to achieve precise location and scale quantification. The agent also has a state representation as the input of FC2 layer with spatio-temporal correlation information on the history localizations, scales, and actions, and receives positive and negative rewards for each action to learn a proper policy.

(1) Actions: The action set A is composed of eight well-designed transformations that is used to transform the sector flexibly and one stop action to terminate the transformation process on current frame and start a series of new actions on next frames. Specifically, the eight transform actions are Bidirectional Expansion (BE), Bidirectional Contraction (BC), Contra Rotation (COR), Clockwise Rotation (CLR), Contra Unilateral Expansion (COUE), Clockwise Unilateral Expansion (CLUE), Clockwise Unilateral Contraction (CLUC), and Contra Unilateral Contraction (COUC). BE is denoted as \((\varTheta _S-\varDelta \varTheta , \varTheta +2\varDelta \varTheta )\), BC is denoted as \((\varTheta _S+\varDelta \varTheta , \varTheta -2\varDelta \varTheta )\), COR is denoted as \((\varTheta _S+\varDelta \varTheta , \varTheta )\), CLR is denoted as \((\varTheta _S-\varDelta \varTheta , \varTheta )\), COUE is denoted as \((\varTheta _S, \varTheta +\varDelta \varTheta )\), CLUE is denoted as \((\varTheta _S-\varDelta \varTheta , \varTheta +\varDelta \varTheta )\), CLUC is denoted as \((\varTheta _S, \varTheta -\varDelta \varTheta )\), and COUC is denoted as \((\varTheta _S+\varDelta \varTheta , \varTheta -\varDelta \varTheta )\). After every action, if \(\varTheta _S >2\pi \), \(\varTheta _S = \varTheta _S \% 2\pi \), if \(\varTheta _S <0\), \(\varTheta _S = \varTheta _S+2\pi \). We set \(\varDelta \varTheta =\frac{\pi }{12}\) in all our experiments based on a good trade-off between speed and localization accuracy according to large numbers of experiments. These eight transformations are well-designed for IVOCT images to fit any possible changes of sector’s location and scale along a sequence of IVOCT frames.

(2) State: To model spatio-temporal location correlation information well, the state is represented as a three-tuples, i.e., \(S=(E,HL,HA)\), where S denotes state, E denotes 1024 features from FC1 layer encoded based on current frame, HL denotes the recent history location and scale of detected sector region (\(HL\in R^{2}\)), and HA denotes the recent 10 history actions (\(HA\in R^{90}\)). Every history action is represented by a 9-dimensional vector with one-hot form. The spatio-temporal location correlation information is modeled into the HL and HA based on the fact that the past location, scale, and actions are always related to future actions whether intra frame or inter frames. In particular, the location and scale of a plaque across adjacent frames are spatially continuous. Hence, based on such state representation S, the FC2 layer can learn a policy to generate proper actions to achieve accurate plaque detection in current frame, as well as accurate tracking in continuous IVOCT sequences.

(3) Reward Function: To achieve an accurate and timely feedback for every action, we design a reward function based on the change of intersection-over-union (IOU) index to quantify whether an action improves tracking or not. Specifically, the reward function is:

$$ \begin{aligned} R=\left\{ \begin{array}{cc} 1, &{} IOU(d^{a},g)-IOU(d,g)>0\\ -1,&{}IOU(d^{a},g)-IOU(d,g)<0\\ 1, &{}IOU(d^{a},g)-IOU(d,g)=0 \& IOU(d^{a},g)>0.95\\ -1, &{} IOU(d^{a},g)-IOU(d,g)=0 \& IOU(d^{a},g)<0.95 \end{array}\right. \end{aligned}$$
(1)

where g is the ground truth sector region from experts’ label, d denotes the current detected sector (CDS) region, and \(d^{a}\) is the next detected sector (NDS) based on current selected action. \(IOU(d^{a},g)-IOU(d,g)=0\) only happens when stop action is selected. 0.95 is set according to clinical application standard to define a proper stop condition, as well as avoid too many unnecessary actions.

2.2 Aided Plaque Localization and Identification Module

Aided localization and identification module not only can provide initial plaque location and scale for initial plaque frame (IPF), but also can avoid over-tracking on images without plaque through a well designed gate. In an IVOCT sequence, the plaque emerges in some continuous frames, and disappears also in some continuous frames. We denote the frame when plaque emerges firstly as IPF, and denote the frame when plaque disappears firstly as stop plaque frame (SPF) along IVOCT image sequences.

(1) Localization and Identification: We formulated the localization and identification into a multi-task framework. A multi-task loss is designed to guide the network generate an initial plaque location and scale for IPF (based on the output of FC3) and an identification for whether a plaque object exists in current frame (based on the output of FC4). The multi-task loss function is

$$\begin{aligned} L=\frac{1}{m}\sum _{i=1}^{m}L_{r}(d_i,g_i)+\frac{1}{m}\sum _{i=1}^{m}L_{c}(c_i,cg_i) \end{aligned}$$
(2)

where i is the index of a frame, m denotes the size of batch size, \(d_i\) is the predicted plaque sector on a frame (is denoted as two-dimensional vector \((\varTheta _S, \varTheta )\)), \(g_i\) denotes the ground truth plaque sector, \(c_i\) is the predicted probability of a plaque existing on current frame, \(cg_i\) is ground truth label of a plaque existing or not on current frame (is 1 if the plaque exists, is 0 if the plaque does not exist), \(L_{r}\) is the L2 regularization loss, and \(L_{c}\) is the two-class Softmax loss.

2) Gate Design: A gate is designed to improve the tracking accuracy avoiding over-tracking (avoid tracking on frames without plaque), and to transfer the spatio-temporal information to the current selected action. When spatio-temporal information transformation happens intra frame, \(G=d_{h}\), where G denotes the output of gate, and \(d_h\) denotes the recent history location and scale of sector. When information transformation happens between adjacent frames, the output of gate is:

$$ \begin{aligned} G=\left\{ \begin{matrix} d_h,&{} I_i=1 \& I_{i-1}=1\\ d_i,&{} I_i=1 \& I_{i-1}=0\\ NULL,&{} I_i=0 \end{matrix},I_0=0,i\ge 1,\right. \end{aligned}$$
(3)

where \(d_h\) is used to transfer spatio-temporal information across adjacent frames, \(d_i\) denotes the predicted sector from FC3 layer, \(I_i\) denotes the identification of current frame, and \(I_{i-1}\) denotes the identification of previous frame. When \( I_i=1 \& I_{i-1}=0\), IPF appears, \(d_i\) is used as the initial sector for the selected action to conduct transformation. When \(G=NULL\), the tracking based RL is stopped in current frame to avoid over-tracking.

2.3 Implementation and Application Process

Implementation and application process includes two aspects, i.e., training and clinical application:

Training and Optimization Process: The parameters in all layers are initialized with Gaussian distribution. The proposed framework are trained using alternate pattern based on Tensorflow and Titan X GPU, and the learning rate was 0.0001. Specifically, we firstly fixed FC2 layer and trained other layers based on loss function in (2) using stochastic gradient descent with 10 epochs, and then trained FC2 layer (RL module) fixing other layers based on the action-reward function in (1) using strategy gradient [10] with 10 epochs. In this way, the proposed framework can achieve good compatibility of two training modes (i.e., traditional regression and RL).

Clinical Application Strategy: Fully-automated tracking (FAT) can be achieved based on the above description about the proposed framework. What’s more, the proposed framework is flexible and semi-automated tracking (SAT) is also allowed in the proposed framework. To fit the clinical application based on the trade-off between accuracy and efficiency, a clinical physician can manually label the sector region in IPF, and specify the SPF. The proposed method can achieve automated tracking between IPF and SPF. Note that in the semi-automated tracking pattern, the gate’s output is simplified as \(G=d_h\) to only transfer the recent history location and scale to the selected action.

3 Experiment and Analysis

We selected IVOCT images by ILUMIEN OPTIS system from 120 patients with 132 pullbacks. 10000 continuous frames (including 154 plaques with experts’ label on plaque location, scale and identification) are used to evaluate the proposed framework, in which the 2000 images are used for training and 8000 images are used for final testing. All images are converted into Cartesian and resized into unified size 150 * 150. Data augmentation is conducted by randomly rotating images during training. According to the widely used measure metrics (i.e., accuracy, sensitivity, specificity) in [1], we evaluated the proposed framework’s tracking performance in frame-level (i.e., accuracy on every independent frame) and plaque-level (i.e., accuracy on continuous frame sequence including a whole plaque, in which accuracy denotes plaque on all frames is detected accurately and continuously) respectively, and compared with state-of-the-art method [1] (To our best knowledge, [1] is the only plaque detection method before us). The accurate plaque detection on an IVOCT image is denoted as \(IOU>0.95\). However, \(IOU>0.95\) is not suitable for [1], because [1] only can achieve binary-level plaque detection. Hence, such a comparison is strict for the proposed framework.

Table 1 shows that the proposed framework (whether FAT or SAT) achieves better tracking performance on frame-level and plaque-level compared with the state-of-the-art method and ablation model (i.e., FAT-RL), which proves the superiority of the proposed framework. Specifically, FAT can achieve better performance than FAT-RL especially on plaque-level accuracy, which proves that the RL module can enhance the precision of plaque tracking with a strict standard (\(IOU>0.9\)). Additionally, though FAT gets relatively lower accuracy than SAT, FAT achieves 10 times faster tracking (average 100 frames every second) than SAT. Hence, FAT and SAT both have superiority in clinical practice.

Table 1. Tracking performance comparison among the proposed method, state-of-the-art method, and ablation model. (FAT-RL denotes removing RL module from FAT, the output of FAT-RL directly comes from FC3 layer.)

4 Conclusion

For the first time, we proposed a novel RL-based framework for accurate and continuous plaque tracking frame-by-frame on IVOCT images. The proposed framework models the spatio-temporal information of adjacent frames to achieve continuous and accurate plaque detection, avoiding potential omissions. Besides, the proposed method has strong expansibility, because the fully-automated and semi-automated tracking styles are both allowed to fit the clinical practice. On large-scale IVOCT data, the high tracking accuracy of the proposed method has been proven. Hence the proposed method has big application potential in future.