Description:
Advancements in data acquisition technologies across different domains, from genome sequencing to satellite and telescope imaging to large-scale physics simulations, are leading to an exponential growth in dataset sizes. Extracting knowledge from this wealth of data enables scientific discoveries at unprecedented scales. However, the sheer volume of the gathered datasets is a bottleneck for knowledge discovery. High-performance computing (HPC) provides a scalable infrastructure to extract knowledge from these massive datasets. However, multiple data management performance gaps exist between big data analytics software and HPC systems. These gaps arise from multiple factors, including the tradeoff between performance and programming productivity, data growth at a faster rate than memory capacity, and the high storage footprints of data analytics workflows. This dissertation bridges these gaps by combining productive data management interfaces with application-specific optimizations of data parallelism, memory operation, and storage management. First, we address the performance-productivity tradeoff by leveraging Spark and optimizing input data partitioning. Our solution optimizes programming productivity while achieving comparable performance to the Message Passing Interface (MPI) for scalable bioinformatics. Second, we address the operating system's kernel limitations for out-of-core data processing by autotuning memory management parameters in userspace. Finally, we address I/O and storage efficiency bottlenecks in data analytics workflows that iteratively and incrementally create and reuse persistent data structures such as graphs, data frames, and key-value datastores. ; Doctor of Philosophy ; Advancements in various fields, like genetics, satellite imaging, and physics simulations, are generating massive amounts of data. Analyzing this data can lead to groundbreaking scientific discoveries. However, the sheer size of these datasets presents a challenge. High-performance computing (HPC) offers a solution to ...
Publisher:
Virginia Tech
Contributors:
Computer Science and Applications ; Feng, Wu-chun ; Pearce, Roger Allen ; Butt, Ali ; Nikolopoulos, Dimitrios S. ; Raghvendra, Sharath
Year of Publication:
2023-11-07
Document Type:
Dissertation ; [Doctoral and postdoctoral thesis]
Language:
en
Subjects:
high-performance computing (HPC) ; big data ; performance ; productivity ; storage efficiency
DDC:
004 Data processing & computer science (computed)
Rights:
In Copyright ; http://rightsstatements.org/vocab/InC/1.0/
Relations:
vt_gsexam:38428
;
http://hdl.handle.net/10919/116640
vt_gsexam:38428
;
http://hdl.handle.net/10919/116640
Content Provider:
VTechWorks (VirginiaTech)
- URL: https://vtechworks.lib.vt.edu/
- Continent: North America
- Country: us
- Latitude / Longitude: 37.225344 / -80.429649 (Google Maps | OpenStreetMap)
- Number of documents: 104,132
- Open Access: 104,132 (100%)
- Type: Digital collection
- System: DSpace
- Content provider indexed in BASE since:
- BASE URL: https://www.base-search.net/Search/Results?q=coll:ftvirginiatec
My Lists:
My Tags:
Notes: