Abstract
CompreHensive Digital ArchiVe of Cancer Imaging - Radiation Oncology (CHAVI-RO) is a multi-tier WEB-based medical image databank. It supports archiving de-identified radiological and clinical datasets in a relational database. A semantic relational database model is designed to accommodate imaging and treatment data of cancer patients. It aims to provide key datasets to investigate and model the use of radiological imaging data in response to radiation. This domain of research area addresses the modeling and analysis of complete treatment data of oncology patient. A DICOM viewer is integrated for reviewing the uploaded de-identified DICOM dataset. In a prototype system we carried out a pilot study with cancer data of four diseased sites, namely breast, head and neck, brain, and lung cancers. The representative dataset is used to estimate the data size of the patient. A role-based access control module is integrated with the image databank to restrict the user access limit. We also perform different types of load tests to analyze and quantify the performance of the CHAVI databank.
Keywords: Radiomics, Oncology, Biobank, PHI, WEB, Databank, Software, DICOM, CHAVI
Introduction
The high quality of medical data is being curated over time by various organizations. The collection of medical imaging and textual data in digital format is being reused across multiple research domains. We propose creating a framework and architecture for a comprehensive medical image data bank. It is developed for storing and handling of the de-identified radiological images and clinical data. We aim to build extensive dataset collections of the oncology patient to make it publicly accessible. The quality of the data must be sufficient for extracting medical knowledge. We illustrate not only data quality but also the entire procedure of curation of dataset to increase its scientific value and usability. Presently, a prototype system is functional for testing and performance analysis. The GUI is provided for the user query interface. Additionally, the system also allows viewing the imaging and clinical data. The imaging modalities could include CT, CBCT, PET-CT, X-RAY mammograms, MR, and ultrasound. In its present version, we consider the dataset of four types of cancers, namely breast, head and neck, brain, and lung cancers.
Prior Works Related to Image Biobank
There are several major image biobank projects in the world and a recent review noted that 70% of these are located in Europe [1]. Most of the large population-based epidemiological studies investigating the health conditions have associated image biobanks. Examples of studies investigating biobanks of images include Study of Health in Pomerania (SHIP) study [2], the UK Biobank study [3], and the German National Cohort study [4]. All of these population-based studies aim to use imaging to identify phenotypic patterns in healthy population and correlate them with genotypic variations. All three of these studies are also cross-sectional and utilize a single MRI imaging for each patient. In a recent survey of imaging biobanks in Europe, it was found that only 30% had more than 2000 images available and half of them were restricted to the institution where the biobank was located [1]. Treatment for cancer is unique so that the diagnosis, treatment planning, treatment delivery and response assessment in cancer care are heavily dependent on imaging features. This has led to countries developing cancer imaging archives with a view to amalgamating engineering technological expertise with medical advances to improve treatment outcomes. The National Cancer Institute of United States of America has put in a major drive under the guidance of their National Biomedical Imaging Archive. The project called “The Cancer Imaging Archive” (TCIA) has got a rich database of cancer imaging data available for education and research [5].
It is a time-consuming affair to capture and curate high-quality imaging data from a retrospective datasets. As reported by Gillies et al, nearly half of the patient data may not be of sufficient quality to include in radiomic studies [4]. Review and filtering of data extensive processing. Publicly available data repositories with appropriate modeling and flexible query interfaces are capable of accelerating the research in medical imaging.
Software Architecture and Technologies
A WEB-based multi-tier system is developed for the archival and retrieval of cancer patients’ data. As shown in Fig. 1, the layers of the system are the following, presentation layer (PL), business logic layer (BLL), and data access layer (DAL) [6]. The system runs on an Apache Tomcat® server [7]. The image databank has role-based access control, which restricts the access of the authorized users. The workflow of the system is shown in Fig. 3. The input DICOM dataset is de-identified through the CompreHensive Digital ArchiVe of Cancer Imaging de-identification system (CHAVI-DDIS) [8]. The clinical data of the patients are de-identified and harmonized. The de-identified patient health records (PHR) are kept in the JavaScript Object Notation (JSON). The databank uses the MySQL a relational database management system (RDBMS) for archiving the dataset. Clinical details, PHR, and radiological image references are stored in the MySQL RDBMS. The DICOM objects (images, structure set, plan, doses) are stored as files in the file system. A semantic data model is designed in a relational database schema. A brief description of different layers of the software architecture is provided below.
Presentation Layer (PL)
Presentation layer provides interfaces to the users of the system. The end users have access to this tier through a graphical user interface (GUI). The presentation layer directly interacts with the business logic layer. The primary function of the level is to translate the task and provide results in an organized form that the user can understand. We use HTML, CSS, JavaScript, and Jquery to design the GUI. The Java server page (JSP) communicates with the BLL.
Business Logic Layer (BLL)
This layer implements processing and transactions of data using the domain knowledge. It acts as an intermediary between PL and DAL. The BLL coordinates the users’ input commands and makes logical decisions. The BLL is strictly resided on the server-side and isolated from the other layers in the system. A client application’s business logic module tends to be interlinked with the user interface for handling the users queries. If there is such a logic in BLL that requires regular modifications during the client application development, the process for developing the client and server parts becomes tightly dependent to each other. The BLL should be designed in such a way which can run independently. We build this layer with JAVA remote method invocation (RMI), which runs on the server-side and facilitates remote communication in an application. It authorizes an object to call the methods on another object running in different Java virtual machines (JVM) using two java objects, named stub and skeleton. In the databank, there are three types of users, namely data providers, data viewers, and administrators. The BLL forward users request to the DAL for processing the query and generate the result. The business data need to be formed for the presentation. These data include the textual representation of business entities or graphical representation.
Data Access Layer (DAL)
The DAL is comprised of different types of datasets in the RDBMS. The storing and preservation of the business dataset is delegated through DAL that interfaces with the database server. The information is passed back to the BLL for processing, then eventually back to the end-user. The Java RMI feature facilitates database server isolation. If required, the WEB and database server can be kept in different locations, and it can interact with each other through remote communication. The presentation layer does not have direct access to the data access layer. The queries are created to the business logic layer, which is sent to the data access layer. The data access layer does not accept any relational queries from the presentation layer. Every data transaction occurs in a single session.
Data Mapping and Uploading
DICOM Dataset Uploading
The de-identified DICOM objects maintain a semantic structure based on the DICOM hierarchical model. These are kept in a directory of the CHAVI-DDIS. The designated folder also contains the associated study-related information in a JSON file. As shown in Fig. 2, the database schema of the CHAVI databank follows a predefined structure that de-identified radiological and clinical data can map one to one in the database table. The radiological data are stored in DICOM study information (dcmstudyinfo ), DICOM Series information (dcmseriesinfo ), and DICOM metadata (dcmmetadata ) tables. The project and patient information tables are common for radiological and clinical data. Remaining all the tables are created for storing the clinical data. The details of the tables are described in et al. [9]. All the database table list are summarized in tabular format, as shown in Table 1.
Table 1.
Table Name | Description | Table Name | Description |
---|---|---|---|
General Records | |||
Project Information | Project details are stored in this table. A project is a specific disease site. | Patient Information | This table stores the patient unidentifiable demographic information. |
DICOM Records | |||
DICOM Study Information | It stores the DICOM study related information. | DICOM Series Information | This table contains the DICOM series information. |
DICOM Metadata | A few metadata are extracted from the de-identified DICOM and stored in this table. | ||
Clinical Records | |||
Diagnosis | It contains diagnosis details and anatomic site. | Stage Information | All the staging details are stored in the table. |
Habits | This table includes present and past substance abuse-related habits. | Family History | It contains a few specific details such as the relationship of the affected family member with malignancy, survival status, type of cancer the family member had, age of the family member at the diagnosis of his/her cancer. |
Personal History | It contains the personal history such as marital and immunization status | Mensthistory | This table stores menstrual history details. |
Comorbidity | The table includes the comorbidity details of the patient. | Survival | It tracks the current state of the patient (dead or alive) of survival. |
Nodal Status | It stores nodal involvement noted in pathology. | Recurrence | This table allows storing data on tumor responses to therapy also. |
Tumor Pathology | This table contains pathological information on the tumor. | Tumor Margin | It stores the information on margin status of the tumor. |
Tumor Description | This table stores information on the gross description of the tumor. | ||
Genomics Information | |||
Expression Profile | It contains genetic information when expression changes. | Fish Profile | It includes the genetic details with FISH. |
Mutation profile | It consists of mutation type and genetic details. | IHC profile | This table contains the name of the protein tested for using IHC and results. |
Imaging Information | |||
Imaging Data | It holds the image type and date. | Imaging Features | It contains the imaging details like timing, site abnormality, edema etc. |
Treatment Information | |||
Immunotherapy | It stores immunotherapy treatment details | Chemotherapy | It holds the intent of chemotherapy treatment, regimen, date etc. |
Radiotherapy | This table stores the treatment details of the radiotherapy. | Drug Information | It includes the drug related information such as drug name, doses, units etc. |
Targeted Therapy | The table contains the treatment details of targeted therapy. | Endocrine Therapy | It stores the endocrine therapy treatment details. |
Surgery | This table holds the surgery date, details, sentinel node biopsy status etc. | others attributes | There are a few de-identified clinical data which does not fit in this listed table. This table will accommodates those information in a form of key values pair. |
The de-identified dataset is exported to an external media drive. Then the dataset is uploaded directly to the databank. Uploading interface also accepts the de-identified dataset in a folder. The system first extracts the metadata from the DICOM object. Each level’s unique identifier is the following, Patient ID, Study Instance UID, Series Instance UID, and SOP Instance UID. These de-identified UIDs and related information are stored in three different tables. The DICOM object contains patient ID, age, sex, and date of birth that is stored in the patient information table. The study-related information such as Study Instance UID, Frame of Reference UID, image type, modality, study date, and the DICOM study’s reference path are stored in the study information table. The series information table has the associated Patient ID, Study Instance UID, and Series Instance UID. This relationship is also maintained in the database schema through the key referencing. The referential data integrity in de-identified DICOM data is maintained after the data uploading. After processing the DICOM files, the JSON file is parsed from the same directory. The databank extracts the following information, anatomic site, image type, and laterality. These data are linked to the corresponding DICOM study. The databank also stores a few selected metadata in a single table. The DICOM objects are stored in a file system while the path references are kept in the MySQL database.
Clinical Data Archiving Process
The de-identified clinical data are accumulated in a JSON with the associated Project ID. A project is defined based on a particular cancer (e.g., glioblastoma, head–neck cancer) types. A JSON file may contain several patients’ medical records. Each group of similar data is compressed in a JSON array as a form of objects. A single record is associated with the corresponding table and attribute. This association is maintained in the JSON. A JSON object can be distinguished by the Patient ID and the Object ID. The object ID indicates the number of instances of this study.
Once the JSON file is imported to the databank, the system first reads the project ID from the JSON and maps all the data with the corresponding project. The system has four levels of hierarchy in the database, as shown in Fig. 2. All the tables that store the information related to treatment and therapy details are connected to the stage information. Stage information is functionally dependent on diagnosis. The diagnosis records are dependent on patient information. The dependent data can be obtained at the first iteration from the JSON. For example, if diagnosis data are fetched before patient information or stage information is extracted before diagnosis, the system does not allow to insert the data in the database. The system sorts the JSON objects sequence based on the functional dependency to overcome this issue. Then the data are uploaded sequentially in the databank. The process of data insertion to the CHAVI databank is shown in Algorithm 1.
Relationship Between Clinical and DICOM Data
The patient health records and DICOM data are uploaded independently to the system. In the DICOM, the patient and study have one to many relationships. Similarly, patient and diagnosis have one to many relationships in terms of patient health records. The clinical data contains the age, gender, performance status, registration date of the patient. There are a few records obtained from DICOM and clinical data like age, date of birth, gender, etc. The databank gives more priority on uploading clinical data over DICOM metadata. The unidentifiable demographic information is extracted from the DICOM object, which is replaced with the demographic details associated with clinical data. As shown in Fig. 4, The patient ID is obtained from both DICOM metadata and clinical record. Similarly, the anatomic site, image type, and laterality are taken from the user, which are associated with the DICOM study details. The same piece of information is extracted from the clinical data that is stored in diagnosis table. This information builds one to one attribute mapping between clinical and imaging data. It helps to retrieve the radiological studies based on clinical queries.
Image and Clinical Data Storage
The image databank mainly focused on how the data can be stored and retrieved efficiently. Both clinical and radiological information are stored in the RDBMS while both are internally connected. The CHAVI-DDIS is the only source of the de-identified data which is being uploaded to the databank. The CHAVI-databank first recognizes the structure of the data and then starts the uploading process. The system supports the DICOM study upload of multiple patients at a time. The DICOM slices, RT structure set, RT plan, and RT dose are kept in a single frame of reference as maintained in the original DICOM. The unique key reference is retained for each instance. The system reinforces the data integrity for the incremental data upload of the patients. A user interface is provided to visualize the cross-sectional view of the DICOM and corresponding metadata. It enables the users to manually reviewing the de-identified data after uploading. The system also provides a viewer for reviewing the DICOM volume and checked the de-identified metadata, as shown in Fig. 5. The DICOM dataset can be segmented based on the multiple disease sites and easily extracted. We even create a search dictionary on the DICOM metadata and index each single DICOM file to make the data retrieval fast.
Incremental Data Uploading
The users are not restricted to uploading all the patient data at a time. As the de-identification system maintains the referential integrity in the schema, the successive data will be mapped to the corresponding patient. It enables flexible access to the CHAVI databank.
Prevention of Errors and Misalignment of Data
Single source of truth (SSOT) is the concept of aggregating the data from various systems in an organization to a single schema. The databank follows the SSOT. As the system does not allow uploading de-identified data from any other de-identification tools apart from CHAVI-DDIS. All the data elements are archived once in a system of record (SOR) that is updated to the database and synchronized with the associated dataset. As the data are already filtered and harmonized, there is a little chance of data misalignment.
Flexible Query Development Interface
A query interface is developed and integrated in the databank. The interface provides a user friendly access to retrieve oncology data from the system based on multiple queries, as shown in Fig. 6. The query can be executed in a composition form of diagnosis site, study date range, patient age, DICOM modality, image type, and treatment. The diagnosis site is a predefined list of sites which is added in the database table. Every time de-identified clinical data are uploaded the diagnosis site is validated with the existing list of the diagnosis site in the databank. It ensures a consistent data storage. Similarly, image types are enlisted in a separate look up table. The data also can be viewed disease sitewise. The researcher can download the required DICOM volume and associated clinical data for the further analysis. The databank mainly focus on storing the complete study of the patient, following diagnosis, therapy treatment planning, therapy treatment response, etc.
Free Text Search
The non-medical professionals may not be familiar with the medical terms or their lexicons. They might face difficulty regarding correct spelling. As shown in Algorithm 2, we use the longest contiguous subsequence algorithm [10] to suggest the exact search parameters. It is mainly developed to find correctly spelled nearby medical terms. A misspelling tolerance is set on the input string. Misspelling tolerance (MT) is the number that indicates the maximum number of letters can be misplaced and replaced within an input string. If the user is searching for “Glioblastoma” and mistakenly type it “Globlastama”/“Gliblastam”/“Glabloastama”/“glayoblastom”, etc., then the system will suggest glioblastoma when the MT is set to two (2). A total of four threads are running to make searching fast. The whole text data are divided equally into four parts, where each thread searches on a single part. It saves the searching response time by four times. The number of threads and division can be changed based on the size of the dataset. In our analysis, we consider a list of 370104 distinct dictionary words and run this search algorithm. As shown in Table 2, we observe that the technique is very much efficient for a long string (Glioblastoma, Radiotherapy, etc.). When we search for a short string (Brain, Lung, etc.) and increase the MT, searching time increases and produce extraneous results. The MT must be calculated based on the input string length and number of threads can be adjusted based on the data dictionary.
Table 2.
Correct Lexicon | Input Lexicon | Number of Output | MT | Search Time (ms) | Does Include Correct Lexicon |
---|---|---|---|---|---|
Glioblastoma | Gliblastam | 0 | 0 | 434 | No |
Glioblastoma | Gliblastam | 0 | 1 | 569 | No |
Glioblastoma | Gliblastam | 4 | 2 | 624 | Yes |
Radiotherapy | radiotharapy | 0 | 0 | 788 | No |
Radiotherapy | radiotharapy | 3 | 1 | 941 | Yes |
Radiotherapy | radiotharapy | 17 | 2 | 811 | Yes |
Brain | Brain | 869 | 0 | 1039 | Yes |
Brain | Brin | 14559 | 1 | 9350 | Yes |
Brain | Breyn | 93397 | 2 | 60736 | Yes |
Data Security and Confidentiality
The databank has the role-based access control. It has three different types of roles which defines user specific access limit. The data provider can upload and review the de-identified data. The data viewer can search and download desired dataset. The administrative control like user management, center management, and project details can be operated through the system administrator. Dataset access is only possible after user authentication in the system. The data provider can upload the DICOM and clinical dataset of the patient. This user has only access of the data which is uploaded by them. The data provider can review the dataset after uploading it to the databank. The cross sectional (Axial) view of the DICOM volume can be visualized through the browser. Simultaneously, the de-identified metadata are displayed aside. The BLL of the system can be installed in different server for the communication. This feature enables to keep the database separately. The users do not have the direct access to the database. All the security features are highlighted in the list below
Java RMI restricts the direct access to the data access layer.
Hypertext Transfer Protocol Secure (HTTPS) is used for secure communication over the network. In HTTPS, the communication of messages is encrypted using Secured Socket Layer (SSL) protocol.
Role-based access control is used to restrict the user access.
Users confidential data are kept encrypted.
Testing and Performance Analysis
It is required to check the system or software how it performs under different kind of workload in terms of stability, and responsiveness. The primary goals are to identify the performance bottlenecks. We perform this test to check whether the image databank meets the expected stipulation for software scalability, speed, and stability. The test run is executed on a server with configuration of 16GB RAM, 1TB Hard disk, and Intel I7 processor. In our experiments the base configuration is shown in Table 4. The entire system infrastructure should be analyzed thoroughly to achieve accurate result. We outline all the associative tools and softwares in Table 4, which are used in the testing and experimenting process. The WEB server runs on Apache Tomcat. This server is installed in a Linux operating system. The Apache Tomcat is using Java Development Kit (JDK) 1.8.0131-b11. The software data access layer is using MySQL RDBMS. The performance analysis of the system is measured using Apache JMeter™ [11]. This tool also runs on JDK1.8. A total of 120 patient data was uploaded to the image databank for the testing process. The disease sitewise distribution is given in Table 3.
Table 4.
Components | System Details | Description |
---|---|---|
OS Name | Linux/4.15.0-1091-oem | Operating system on which the WEB server is installed |
Configuration | Server Configuration | 16GB Random access memory (RAM), 1TB Hard disk drive (HDD), and Intel Core i7 8th Gen processor. |
WEB Server | Apache Tomcat/7.0.82 | Tomcat provides HTTP web server environment to run the Java code. |
JVM Version | JDK 1.8.0_131-b11 | The Java Development Kit (JDK) is a software development environment, which is used for developing java applications. |
Database | MySQL (RDBMS) | MySQL is an open-source relational RDBMS based on SQL – Structured Query Language. |
WEB Browser | Google Chrome 83.0.4103.116 | Browser for accessing the system over internet/intranet. |
Load tester | Apache JMeter 5.3 | It is an open-source Java application used for load testing, measure performance, and functional behavior. |
Table 3.
Project Name | No of Study |
---|---|
Brain | 30 |
Head and Neck | 30 |
Breast | 30 |
Lung | 30 |
As shown in Table 8, a test analysis for the image databank is conducted using Apache JMeter to test the performance of the system on different load conditions. The four modules are taken to perform this test, as summarized in Table 5. These modules are following, normal accessing to the server (Home Page), user login module, de-identified DICOM/clinical data uploading, and data retrieval module. As shown in Table 6, we summarize all the parameters that are measured to test the system. Throughout the entire testing process we compute average response time, throughput, standard deviation of these measures, and fraction of failed requests (FFR). All the test is performed with an average internet speed of 512 Mbps. As shown in Table 7, different types of test is summarized in tabular format, which is being executed throughout the performance analysis of the CHAVI databank. The Tomcat server also runs with the default configuration. The ramp-up period is the total amount of time to add all the threads in the test execution. If n threads are used for testing, and the ramp-up period is set to k seconds, then the JMeter tool will add all n threads in k seconds to be executed. When the bulk amount of request hits to the server concurrently, it may happen that some of the request cannot be executed due to following reasons, NoRouteToHostException, ConnectTimeoutException, HttpHostConnectException, etc.
Table 8.
Threads Count (Users) | 10 | 50 | 100 | 200 | 500 | 1000 | 5000 | 10000 | 15000 | 20000 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Ramp-Up Period (seconds) | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
Home Page | Average response time (ms) | 1 | 1 | 0.82 | 0.735 | 0.668 | 0.543 | 0.484 | 0.2494 | 2.203 | 1.092 |
Standard Deviation (ms) | 0.7 | 0.96 | 0.46 | 0.6 | 0.52 | 0.58 | 0.57 | 0.65 | 11.68 | 5.38 | |
FFR% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
Login | Average response time (ms) | 4 | 4.28 | 4.29 | 4.335 | 4.196 | 3.821 | 3.546 | 1981.51 | 542.53 | 54956.76 |
Standard Deviation (ms) | 0.45 | 1.6 | 1.85 | 3.56 | 1.67 | 1.68 | 2.49 | 5509.77 | 1475.16 | 57480.9 | |
FFR% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0 | 37.23 | |
File Upload | Average response time (ms) | 5.6 | 5.86 | 5.91 | 6.72 | 5.932 | 5.755 | 6.1534 | 7.066 | 26.57 | 232.26 |
Standard Deviation (ms) | 0.66 | 1.06 | 2.04 | 2.59 | 3.83 | 4.65 | 6.7 | 12.39 | 38.97 | 251.03 | |
FFR% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
Data Retrieval | Average response time (ms) | 10.8 | 10.4 | 11.02 | 10.88 | 10.606 | 11.14 | 10.028 | 9.56 | 12.072 | 16.845 |
Standard Deviation (ms) | 1.33 | 0.72 | 4.72 | 4.14 | 4.34 | 4.9 | 3.1 | 1.7 | 1.7 | 57.99 | |
FFR% | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Table 5.
Name | Details |
---|---|
Accessing the Home Page | Concurrent requests from multiple clients to the home page of the server. |
Login | Multiple users login to the system concurrently. |
File Upload | Uploading DICOM files to the system. |
Data Searching | Execution of multiple queries to fetch the desire data. |
Table 6.
Components | Description |
---|---|
Threads count | The number of concurrent users, where each thread corresponds to a user. |
Ramp-up Period | The total amount of time which shows how long it takes to get all the threads executed. |
Loop count | It defines the number of times the test plan to be executed. |
Input Load/Second | Number of requests sent to the server in one second. |
Response time | Time between making a request by a client and getting a response from the server. |
Average response time | It is the average time (ms) taken between request sent and getting reply from the server. |
Throughput/Minute | Number of successful request is sent to the server per minute is called Throughput/Minute. |
Standard Deviation | In Jmeter, it is population standard deviation (STDEVP). Standard Deviation () shows the mean distance of the values which are deviating from the average () value of sample response time. |
Fraction of failed requests (FFR)% | It shows the percentage of failed requests (NoRouteToHostException, ConnectTimeoutException, HttpHostConnectException, etc.) from total samples. |
Table 7.
Test Name | Description |
---|---|
Load Test | This test is performed to check whether the server is able to fulfil its specification or not when it runs in full fledged. |
Stress Test | It measures the behavior of the system under peak bursts of user request. This test evaluates the limit of the specified requirements. It quantifies the stress point where the server behaves abnormally. |
Spike Test | It checks the system stability during the bursts of concurrent users. |
Capacity Test | This test is conducted to find the overall capacity of the software and measures an end-point where response time and throughput become intolerable. |
Initially, the load test is performed with different input load considering the real time scenarios. The number of threads is taken ten and gradually increased to 20000. The ramp-up period is set to 100 seconds for all the test cases. We measure the system performance by increasing the number of users request by keeping the ramp up period constant. As shown in Fig. 7, the graphs are plotted, where each color represents the average response time in Fig. 7a and c and standard deviation in Fig. 7b of the four tasks. The blue color indicates the performance for accessing the homepage; red specifies the performance to log in into the system; green shows the performance of the file uploading module; and violet indicates the performance of data searching. As shown in Fig. 7a and b, all the four tasks execute successfully without stress when users request are 50 Hits/second. A graph is plotted in Fig. 7c to show the performance of the system when the number of threads start with ten and increases to 5000 (50 Hits/second). A few login requests failed (0.07%) when the number of hits/second is 100. The number of failed requests increases to 37.23% when the number of hits/second is 200. Accordingly, average response time and standard deviation become very high as shown in Fig. 7a and b. However, the remaining three tasks following, browsing the WEB page, multiple DICOM files uploading, and dataset searching run without stress until the users request is 150 hits/second. Further, 20000 requests are sent in 100 seconds ramp-up time when the WEB page can be browsed barely in the image databank server. Further, we perform the spike test and a case study to test maximum load capacity on different configuration of server for browsing the server page.
Stress and Spike Test
Stress and spike tests are performed to assess the system’s fluctuations and stability when the bursts of concurrent users request are sent to the server at the same time. The testing process is started with 100 concurrent users request and successively increases until we achieve any FFR. As shown in Fig. 8a, a graph is plotted to show the variance of the system while accessing the WEB page. We got a 5.91% FFR when 30000 concurrent users request are sent to the server. The standard deviation of the average response is 69.92 ms. Then the system is tested with the concurrent user login to find the stress-point. As shown in Fig. 8b, we observe that standard deviation becomes very high at the stress-point, when the FFR is 0.24%, and the number of users turns to 5000. The similar patterns are noticed of linear increment in standard deviation for file uploading and data searching. As shown, the graph is plotted in Fig. 8c, the standard deviation is 1076.96 ms when the number of concurrent users is 5000. FFR is 2.24%. The image databank can process a comparatively high number of concurrent users queries while they search for a dataset, as shown in Fig. 8d.
Maximum Load Capacity with Different RAM Configuration
Load Capacity with 16 GB RAM
The maximum load capacity is tested by increasing the number of threads in constant time. The ramp-up period is kept 10 seconds, while the loop count is 5. As shown in Fig. 9, an initial load is consigned to 1000 threads and gradually increases to 30000. The average response time 9 ms and 5.19% requests are failed to access the CHAVI databank when the number of threads is 30000. Then the number of threads is decreased to 29000. Accordingly, the error is decreased to 2.67%. The users’ request is dropped to 28000, and error becomes 0.00%. The throughput is 12538.4/second. However, we find the optimum throughput (14379.4/second) when the number of threads is 25000.
Load Capacity with 8 GB RAM
The complete process is executed in a 8 GB RAM server. The FFR was almost similar to 16GB RAM. The FFR was 5.92% when the number of threads is 30000. When the number of threads is decreased to 29000, the FFR is decreased to 2.68%. There is noticeable difference between throughput and standard deviation. The maximum throughput is around 14000 request/second in 16 GB RAM while number of threads is 26000. The maximum throughput is 4002.9 in 8GB RAM server while the threads count is 20000. The 16GB RAM server performance was consistent till the number of threads was less than 28000. The average response time is 24 ms in 8GB RAM where the average response time is 13ms in the 16GB RAM on same load.
Load Capacity with 4 GB RAM
The same test process is executed on a server having 4GB RAM. Initially, the test is run with 1000 threads while the ramp-up period is kept 10 seconds, and the loop count is 5. The number of threads is increased to test the maximum load capacity, continuing ramp-up and loop constant. As shown in Fig. 10, when the threads count is reached at 10000, we got a fraction of failed requests 0.71%. The response time was 3415ms. Then the number of threads is decreased to 9000. Accordingly, the error is decreased to 0.15%. Further, the threads count is increased to 6500, and FFR becomes 0.00%. The throughput is also increased from 2481.2 request/second to 2498.8 request/second and the average response time is 10ms. We observe that the system is able to handle a maximum around 8000 user requests. However, the server was stable and consistent when the threads count is around 6500.
Minimum Configuration Requirements of Server
The maximum and optimum load capacity is summarized in Table 9. Based on the test run on three different RAM configurations, we have observed that the system with lower configuration (4GB RAM) can manage less than 8000 users’ requests in 10 seconds with zero FFR. Most of the requests failed because of the connection timeout when the total threads count is high. However, the system performance is not satisfactory. The system’s performance remains consistent until the user request is around 6500 in ten seconds.
Table 9.
Threads Count (Users) | |||
---|---|---|---|
RAM configuration | Parameters | At Optimum Load | At Maximum Load |
4GB | Throughput/second | 2498.8 | 762.6 |
Threads count | 6500 | 8000 | |
Avg. Response Time (ms) | 10.73 | 1344.06 | |
8GB | Throughput/second | 4920.9 | 3665.5 |
Threads count | 20000 | 28000 | |
Avg. Response Time (ms) | 59.39 | 1573.28 | |
16GB | Throughput/second | 14379.4 | 12538.4 |
Threads count | 25000 | 28000 | |
Avg. Response Time (ms) | 25.58 | 37.35 |
Similarly, the observation on 8GB RAM is stable when the threads count is less than 20000. However, the server can handle similar load compared to having 16GB RAM server. When the CHAVI databank is installed in a server with 16GB RAM, the CHAVI databank’s performance is stable indeed, we send 26000 requests in 10 seconds. However, there is 0% FFR when 28000 requests are sent in same ramp-up period. Empirically, the estimated server configuration can be calculated based on the user load to access the image databank. We also observed that the Tomcat server needs to be configured [12] when it is installed in a high-end server. The default configuration of Tomcat server does not utilize all the memory.
Estimation of Data Size
A few patients’ radiological and clinical data are taken from multiple disease sites. The radiological data contains diagnostic images and therapy images (treatment therapy planning images, treatment therapy verification, and treatment therapy response images). A statistical analysis has been performed to obtain an estimated size of the complete treatment data. The histogram is plotted to visualize the data size distribution on different types of images, as shown in Fig. 11. The total data size is measured for the fifty patients. The average data size we get 2.96 GB per patient extracted from the histogram, as shown in Fig. 11. The data size of clinical records are measure in kilobytes (KBs). We keep this size constant across the calculation of data size estimation. The standard deviation () of the radiological data size is 1.49.
1 |
We provide a tolerance of twice the standard deviation for computing the storage requirement. Now the data size estimation can be calculated from Eq. 1. As an example, we provide a typical figure for storing 1000 patients data.
Conclusion
The CHAVI image databank is developed to make the radiomics data available for research and analysis. The proposed system accommodates annotated image data of the cancer patients in the databank. The system provides a comprehensive platform to upload diagnostic and annotated images with a patient’s corresponding clinical information. It supports flexible queries from the end-users with an appropriate choice from a set of attributes. The GUI is also provided to various query filtering. We evaluate different types of testing and performance analysis in the CHAVI databank. We provide an estimation to calculate the data size for N number of patient. Currently, the prototype system is installed in Tata Medical Center Kolkata. It is planned to make the server publicly accessible for accessing the de-identified data.
Acknowledgements
We would like to extend our gratitude to all the patient, who has given the consent for using their dataset in this study. This project is funded under National Digital Library of India (NDLI) sponsored by Ministry of Human Resource Development (MHRD), Govt. of India.
Funding Information
This study has been funded by the Ministry of Education, formerly the Ministry of Human Resource Development, India (IIT/SRIC/CS/NDM/2018-19/096).
Declarations
Compliance with Ethical Standards
CHAVI-databank complies with the principles of data being findable, accessible, interoperable [13].
Findable: Each data stored in the database corresponds to a unique identifier. We also maintain referential data integrity in the dependent data. Which ensures the data findability.
Accessible: Data viewer is a specific user in the databank. Users can browse data and view them. Then the data can be accessed with proper authentication or authorization.
Interoperable: The structure of the DICOM is kept intact even after de-identification. This data can be incorporated with any other applications. Theclinical data of the patients are kept in a relational database. So DICOM and clinical data both can interoperate with any other similar applications. The Reusable part is not specified as of yet we are in the process of discussion with the stakeholders to define the data reuse terms and conditions.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging. 2013;26(6):1045–1057. doi: 10.1007/s10278-013-9622-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.A. S. Foundatio. The HTTP Connector, 2020 (accessed July 26, 2020). https://tomcat.apache.org/tomcat-7.0-doc/config/http.html.
- 3.Gatidis S, Heber SD, Storz C, Bamberg F. Population-based imaging biobanks as source of big data. La radiologia medica. 2017;122(6):430–436. doi: 10.1007/s11547-016-0684-8. [DOI] [PubMed] [Google Scholar]
- 4.Gillies RJ, Kinahan PE, Hricak H. Radiomics: images are more than pictures, they are data. Radiology. 2016;278(2):563–577. doi: 10.1148/radiol.2015151169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.E. H. Halili. Apache JMeter: A practical beginner’s guide to automated testing and performance measurement for your websites. Packt Publishing Ltd, 2008.
- 6.Karlin S, Ost F. Counts of long aligned word matches among random letter sequences. Advances in applied probability. 1987;19(2):293–351. doi: 10.2307/1427422. [DOI] [Google Scholar]
- 7.Kundu S, Chakraborty S, Chatterjee S, Achari RB, Mukhopadhyay J, Das PP, Mallick I, Arunsingh M, Bhattacharyyaa T, et al. De-identification of radiomics data retaining longitudinal temporal information. J. Medical Syst. 2020;44(5):99. doi: 10.1007/s10916-020-01563-0. [DOI] [PubMed] [Google Scholar]
- 8.S. Kundu, S. Chakraborty, J. Mukhopadhyay, S. Das, S. Chatterjee, R. Basu Achari, I. Mallick, P. Pratim Das, M. Arunsingh, T. Bhattacharyya, et al. Research goal-driven data model and harmonization for de-identifying patient data in radiomics. Journal of Digital Imaging, 34(4):986–1004, 2021. [DOI] [PMC free article] [PubMed]
- 9.A. K. Maji, A. Mukhoty, A. K. Majumdar, J. Mukhopadhyay, S. Sural, S. Paul, and B. Majumdar. Security analysis and implementation of web-based telemedicine services with a four-tier architecture. In 2008 Second International Conference on Pervasive Computing Technologies for Healthcare, pages 46–54. IEEE, 2008.
- 10.C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3), 2015. [DOI] [PMC free article] [PubMed]
- 11.H. Volzke. Study of health in pomerania (ship). concept, design and selected results. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz, 55(6-7):790–4, 2012. [DOI] [PubMed]
- 12.A. Vukotic and J. Goodwill. Apache Tomcat 7. Springer, 2011.
- 13.M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. W. Boiten, L. B. da Silva Santos, P. E. Bourne, et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1):1–9, 2016. [DOI] [PMC free article] [PubMed]