Abstract
The success of artificial intelligence in medicine is based on the need for large amounts of high quality training data. Sharing of medical image data, however, is often restricted by laws such as doctor-patient confidentiality. Although there are publicly available medical datasets, their quality and quantity are often low. Moreover, datasets are often imbalanced and only represent a fraction of the images generated in hospitals or clinics and can thus usually only be used as training data for specific problems. The introduction of generative adversarial networks (GANs) provides a mean to generate artificial images by training two convolutional networks. This paper proposes a method which uses GANs trained on medical images in order to generate a large number of artificial images that could be used to train other artificial intelligence algorithms. This work is a first step towards alleviating data privacy concerns and being able to publicly share data that still contains a substantial amount of the information in the original private data. The method has been evaluated on several public datasets and quantitative and qualitative tests showing promising results.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The diversity of images created within a hospital is enormous, varying in respect of patient, the depicted content and the used modality. Using artificial intelligence in order to perform classifications, registration or segmentation on the images can reduce the doctor’s workload and improve the overall efficiency of hospitals and clinics. A successful implementation such as the classifier to detect skin melanomas [1] uses more than 100,000 images in order to achieve accurate results. However, getting access to such amounts of images is often only possible for large corporations and hospitals, as this information is sensitive cannot be shared with the public due to doctor-patient confidentiality. Generative adversarial networks (GANs) could be applied on large datasets within hospitals in order to create synthetic data that can be shared with the public. The following enumeration describing Fig. 1 introduces a possible solution overcome the limitations of data privacy.
-
1.
Medical images (X-rays, MRIs, CTs) are generated in the hospital. The images are then used for their original purpose (patient diagnosis, surgery planning, etc).
-
2.
The images cannot be uploaded onto a cloud to be shared with the public due to data privacy regulations.
-
3.
The same images are saved on the hospital server (not accessible by the public). Any authorized employee can start a user application to train a model that represents the statistical information of the images. A neural network is depicted in order to represent any deep learning model that may be used in this step. It converts the input data into statistical information.
-
4.
The model generates images that show a great variety of the features in the model’s input images while still showing the same basic object. It can be used to create any number of artificial images.
-
5.
The generated images or the trained model can then be uploaded into the cloud and be shared with the public without concern for data privacy issues. In order to avoid sharing of generated data too similar to any of the original images, only a subset of the generated data with a certain distance to the originals could be shared.
A broader availability of such images creates the possibility of collecting more ideas and research results about solving complicated and common medical problems.
2 Background and Related Work
GANs were first introduced in [5] and are based on an adversarial principle. There are two players: the generator (G) and the discriminator (D). G is a neural network that generates images from a noise vector, whereas D is a neural network that tries to distinct between original image or generated image. The goal is to find the Nash equilibrium, where neither G nor D can change in order to improve the results. The process can be written as a minimax game:
GANs are of great interest in medicine due to the necessity of unsupervised learning, as labeled data is time consuming and rare. Examples of successful GAN applications for medical images are [6] which improves segmentation performance when introducing data variety into a brain-tumor dataset, [7] shows great promise with an implementation of a progressive GAN in order to generate realistic mammography images of a resolution of up to \(1280\times 1024\). [8] synthesizes realistic-looking retinal images working with only a small dataset of as few as 10 training examples of annotated binary vessels and applying style transfer.
3 Proposed Model
WGAN-gradient penalty (WGAN-GP) [9] provides a method to create sharp, realistic looking images without being prone to mode collapse and training instability due to unbalanced D and G. WGAN-GP is an improvement over the prior WGAN model [10] it is based on. The Earth mover distance (or Wasserstein-1-distance) is used to calculate the cost of transporting the estimated data distribution into the real data distribution using the Kantorovich-Rubinstein duality. In order to solve this equation and find the smallest loss, the functions used within must be 1-Lipschitz. This is achieved by introducing the gradient penalty that penalizes every function if it strays from the target norm value of 1. The discriminator loss can be written as:
Where the first term represents the regular discriminator loss and the second one the gradient penalty multiplied with a fixed penalty coefficient \(\lambda \). \(\hat{x} \sim P_{\hat{x}}\) represents randomly sampled data points between the data distribution \(P_{r}\) and the generator distribution \(P_{g}\).
The WGAN-GP implemented for this paper is based on images of size \(128\times 128\times 3\). G and D are both convolutional neural networks with characteristics as specified in [9]. The input for G is a random noise vector of size 100 which is then upsampled by 6 transposed convolutional layers using 512, 256, 126, 64 and 32 different filters with a stride of 2. The input for D is an image of size \(128\times 128\times 3\) which is either real or generated, and downsamples it with the use of four convolutional layers with 64, 128, 256 and 512 filters each.
Hyperparameters (learning rate, batch size and kernel size) were chosen on a trial and error basis, since there are currently no standard values that work well for all datasets. Values of hyperparameters with overall good visual results were achieved with a learning rate of 0.0002, a batch size of 32 and a kernel size of 5.
The models were trained for 500 epochs (chest X-rays with all images, pneumonia, non-pneumonia) and 10,000 epochs (chest with 115 images, AMD, non-AMD) respectively. See Sect. 4 for dataset name references. The number of epochs may still be increased, but the chosen amount showed a stabilization of the discriminator and generator loss.
4 Data
The evaluation is based on WGAN-GP models trained on three different two-dimensional datasets. All images are resized to RGB images of size \(128\times 128\times 3\).
Two datasets were created from the NIH chest X-ray dataset [11]. The first one (referred to as chest-all) consists of 5606 chest X-rays of size \(1024\times 1024\) by multiple patients (Fig. 2a). The second one consists of 115 randomly chosen images (further referred to as chest-115).
The Kaggle Pneumonia dataset [4] shows chest X-rays of pediatric patients and can be split into 3,875 images of patients with pneumonia (referred to as pneumonia, Fig. 2b) and 1,341 patients without pathological findings (referred to as non-pneumonia, Fig. 2c). The significant difference to the previous dataset is that the images are less heterogeneous. However, the images are of different sizes which is why the resizing process creates slight distortions. This will be ignored as the evaluation is based on the original resized images. For real-life scenarios this should be avoided.
Lastly, the age-related macular degeneration (AMD) dataset [3] shows fundus photographs (photograph of the rear of the eye) of size \(2124\times 2056\) pixels. It can be further divided into 89 images with AMD diagnosis (referred to as AMD, Fig. 2d) and 311 images showing normal maculae (referred to as non-AMD, Fig. 2e).
5 Evaluation Methods
The methods below are used to compare the results of the generated images and how the characteristics of the datasets influence the results.
5.1 Fréchet Inception Distance
The Fréchet Inception Distance (FID) [12] uses an Inception Net trained on ImageNet to measure the distance between the real data distribution and the generated data distribution with the Fréchet distance. A low FID value corresponds to a small distance between the data distributions which should correspond to good visual results.
5.2 Structural Similarity and Mean Squared Error
SSIM [13] is a metric to compare two images, based on the human visual system. This is commonly used when evaluating image compression. It is made up of three local comparisons: luminance, contrast and structure. The combination of the three features returns a value between \(-1\) and 1, where 1 indicates very similar images. In Sect. 5.3 it is used as a measure of distance between generated and original image. Another method to measure this distance is the mean squared error (MSE) which calculates the difference of images based on the squared error between each pixel. The optimum value to be reached between two images is 0.
5.3 Specificity and Generalization Ability
The final model should be able to generate only valid samples of what it has learned from the training images (specificity) and it should be able to represent each of the original images in form of a generated image (generalization ability) [14]. Specificity is calculated by summing up the distance from each sample image to the closest original image and dividing that by the number of generated samples M, as shown in (3). \(Y^{(M)} = \{y_{a} \in \mathrm{I\!R}^{n}: a = 1,...M\}\) represents the set of generated samples and \(x_{i}\) the original image. The generalization ability is calculated by summing up the distance for each original sample to its closest (most similar) generated image and dividing it by the number of original images as shown in (4), with N being the number of original images. SSIM and MSE are used as distance measures.
6 Evaluation
Our approach is a evaluated qualitatively and quantitatively. The qualitative approach is based on the visual impression. For each dataset, it was possible to generate images that look realistic to a non-expert at first sight. Figure 3 provides samples of the generated images using the hyperparameters specified in Sect. 3. The chosen hyperparameters do not lead to optimal results for each dataset, but show that even with sub-optimal hyperparameters, realistic image synthesis is possible. The quantitative evaluation results can be found in Table 1.
Results generated with the GAN trained on only 115 images outperformed the GAN trained on all chest X-rays when looking at the specificity and generalization ability of the model. The FID score however, is worse for the chest-115 model compared to using all data. Combining these two facts with visual analysis, the worse FID score for the chest-115 model might be due to more blurry images. Worse specificity and generalization ability for the chest-all model may be due to multiple almost completely unrecognizable generated images. Examples for unsuccessful image generations of the two datasets are shown in Fig. 4. In this case, the chest-115 dataset served as the better training set, as more anatomical correctness is represented in the results.
The overall FID values achieved for the generated images correlate to the quality of the visual results. However, the values are still relatively high, compared to some state-of-the-art FID values for natural images [15] that have achieved values as low as 7.4, compared to the lowest value of 77.08 achieved for the non-AMD dataset (see Table 1). This large difference shows that there is still a lot of improvement to be achieved when generating such images.
Observing the specificity and generalization ability of all trained models, the (non-) AMD images achieve better values than the other datasets. This underlines the principle of quality over quantity in the data. Still, quantity should not be neglected as the pneumonia and non-pneumonia images show similar quality, but the better values are achieved by the pneumonia model, as it has almost three-times as many training images.
Principal component analysis (PCA) has been applied to the image data to visualize the original data alongside the generated data in one graph in order to assess the differences of the data and detect outliers (see Fig. 5). The evaluation of PCA component diagrams leads to the conclusion, that although a large group of the generated images blend in with the original images there is still a significant number of outliers. When visually assessing such images, it is obvious why they differ from the other images, as they are usually distorted and do not show a high level of detail. This could most likely be improved by training the model longer or with a different architecture.
7 Discussion
The presented approach shows that is generally possible to create artificial and realistic looking images from large sets of medical training data. Our assumption is that such methods could be used in the future to alleviate privacy problems and to allow for greater public sharing of data. At the current point in time, however, there are unresolved questions that need to be addressed first. Primarily, legal aspects of data privacy are varying from country to country and often any use of personal information - even if only used implicitly as in the proposed method - requires a consent from the patient. In our view, obtaining such consent could be eased by explaining to the patient that the possibility of reconstructing the original individual data is not possible. One straightforward way could be to only share generated images which have a certain similarity distance to all original images. That way, it would be hard to reconstruct the original images. The question that still remains is, what would be an acceptable visual difference. A possibility could be a manual check by a doctor that focuses on excluding unsuccessful generated images, as well as images that may resemble original data too closely.
Another point of discussion is the medical and anatomical correctness of the generated data. In our view this can only be judged by a medical expert prior to sharing the data. For some use cases such as assessing the anatomical correctness of organ shapes or boundaries, this can be done quite easily by an expert. Training an artificial intelligence algorithm on such artificial and medically approved data should benefit most from the presented concept. Other use cases such as tumours or other pathologies might be more difficult to assess even by an expert and generated artificial images could be too risky to train other algorithms on.
8 Conclusion
The presented approach shows that it is generally possible to generate realistic looking medical images using GANs that do not reveal the patient’s personal information. Here, the quality of the training images is of great importance. Implementing such an image generating functionality would require a preprocessing step that assures uniformity of the training data as this heavily impacts the generated results. The collected values for the images, such as FID and SSIM can give some guidance, when assessing the accuracy/level of realism of the images, but cannot be used as a standalone method of assessment. Before reaching final conclusions, it is necessary to consult medical specialists to analyze the images regarding anatomical correctness and a match in the diagnosis the images are intended to show. This unsupervised learning method needs to undergo additional training in order to achieve a higher level of realism and minimize the amount of flawed generated images. The presented approach is seen as a first step towards generating synthetic training data in order to reduce the lack of large medical databases available to the public and therefore to improve data accessibility for researchers and developers to create better and more reliable AI in the medical domain.
References
Esteva, A., et al.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017)
Designed by Freepik
iChallenge-AMD. https://amd.grand-challenge.org/. Accessed 15 July 2019
Kaggle Chest X-Ray Images (Pneumonia). https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia. Accessed 15 July 2019
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
Shin, H.-C., et al.: Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In: Gooya, A., Goksel, O., Oguz, I., Burgos, N. (eds.) SASHIMI 2018. LNCS, vol. 11037, pp. 1–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00536-8_1
Korkinof, D., et al.: High-resolution mammogram synthesis using progressive generative adversarial networks (2018)
Zhao, H., et al.: Synthesizing retinal and neuronal images with generative adversarial nets. Med. Image Anal. 49, 14–26 (2018)
Gulrajani, I., et al.: Improved training of Wasserstein GANs. CoRR (2017)
Arjovski, M., et al.: Wasserstein GAN. ArXiv (2017)
Kaggle Random Sample of NIH Chest X-ray Dataset. https://www.kaggle.com/nih-chest-xrays/sample/version/4. Accessed 15 July 2019
Heusel, M., et al.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems (2017)
Wang, Z., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Davies, R., Twining, C., Taylor, C.: Statistical Models of Shape: Optimisation and Evaluation, pp. 78–79. Springer, Cham (2008). https://doi.org/10.1007/978-1-84800-138-1
Brock, A., et al.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Middel, L., Palm, C., Erdt, M. (2019). Synthesis of Medical Images Using GANs. In: Greenspan, H., et al. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging and Clinical Image-Based Procedures. CLIP UNSURE 2019 2019. Lecture Notes in Computer Science(), vol 11840. Springer, Cham. https://doi.org/10.1007/978-3-030-32689-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-32689-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32688-3
Online ISBN: 978-3-030-32689-0
eBook Packages: Computer ScienceComputer Science (R0)