Algorithmic Distribution of Applied Learning on Big Data

Author:

Shukla, Manu [claim]

Description:

Machine Learning and Graph techniques are complex and challenging to distribute. Generally, they are distributed by modeling the problem in a similar way as single node sequential techniques except applied on smaller chunks of data and compute and the results combined. These techniques focus on stitching the results from smaller chunks as the best possible way to have the outcome as close to the sequential results on entire data as possible. This approach is not feasible in numerous kernel, matrix, optimization, graph, and other techniques where the algorithm needs access to all the data during execution. In this work, we propose key-value pair based distribution techniques that are widely applicable to statistical machine learning techniques along with matrix, graph, and time series based algorithms. The crucial difference with previously proposed techniques is that all operations are modeled on key-value pair based fine or coarse-grained steps. This allows flexibility in distribution with no compounding error in each step. The distribution is applicable not only in robust disk-based frameworks but also in in-memory based systems without significant changes. Key-value pair based techniques also provide the ability to generate the same result as sequential techniques with no edge or overlap effects in structures such as graphs or matrices to resolve. This thesis focuses on key-value pair based distribution of applied machine learning techniques on a variety of problems. For the first method key-value pair distribution is used for storytelling at scale. Storytelling connects entities (people, organizations) using their observed relationships to establish meaningful storylines. When performed sequentially these computations become a bottleneck because the massive number of entities make space and time complexity untenable. We present DISCRN, or DIstributed Spatio-temporal ConceptseaRch based StorytelliNg, a distributed framework for performing spatio-temporal storytelling. The framework extracts entities from ...

Publisher:

Virginia Tech

Contributors:

Computer Science ; Lu, Chang-Tien ; Ramakrishnan, Naren ; Chen, Ing-Ray ; Xuan, Jianhua ; Zhang, Jianping

Year of Publication:

2020-10-16

Document Type:

Dissertation ; [Doctoral and postdoctoral thesis]

Subjects:

Big Data ; Distributed Machine Learning ; In-Memory Distribution ; Graph Distribution

DDC:

004 Data processing & computer science (computed)