Research Projects


    A big data workflow is the computerized modeling and automation of a process consisting of a set of computational tasks and their data dependencies to process and analyze data of ever-increasing in scale, complexity, and rate of acquisition. DATAVIEW is a big data workflow management system. It uses Dropbox as the data cloud and Amazon EC2 as the compute cloud. Current research focuses on the performance and cost optimization for running workflows in clouds.

    Big Data Workflow Scheduling

    Workflow scheduling has remained a critical functionality of modern data-centric workflow management systems. Cloud computing, which provides practically unlimited computing and storage resources, has enabled a new generation of data-centric workflows, called big data workflows. New big data workflow scheduling algorithms should optimally utilize the characteristics of cloud computing such as heterogeneous virtual machines, the elastic resource provisioning model, and the pay-as-you-go pricing model, as well as the time and monetary cost to transfer large amounts of data between cloud virtual machines. There is still room for improvement, since the scheduling problem is an NP-hard problem.

    Data Placement in the Cloud

    In this new era of Big Data, there is a growing need to enable scientific workflows to perform computations at a scale far exceeding a single workstation's capabilities. When running such data-intensive workflows in the Cloud distributed across several physical locations, the execution time and the resource utilization efficiency highly depends on the initial placement and distribution of the input datasets across these multiple virtual machines in the Cloud.

    CGM Data Analysis

    Diabetes is a complex disease and there is a considerable demand to capture the range of the disease, including creating predictive algorithms for disease progression using interdisciplinary methods of modeling, computation and medicine. The goal of this research project is to develop a novel, fully automated, machine learning tools with the application of supervised and unsupervised machine learning techniques in order to classify and predicate disease progression. The real patient data is from a continuous glucose monitoring (CGM) study.