Sustainable Computing Lab

Optimizing Distributed Machine Learning for Transient Servers

Large-scale distributed machine learning (ML) is expensive to run on cloud platforms. To reduce cost, cloud platforms now offer cheap transient servers, which they may revoke at any time. This project is re-designing traditional distributed ML algorithms to use looser forms of synchrony, and designing adaptive policies for selecting transient servers based on their performance, cost, and volatility to efficiently run distributed ML on transient servers. This project is funded by the National Science Foundation under grant CNS-1908536.

Publications

Understanding the Synchronization Costs of Distributed ML on Transient Cloud Resources

Pradeep Ambati, David Irwin, Prashant Shenoy, Lixin Gao, Ahmed Ali-Eldin, and Jeannie AlbrechtIC2E 2019

Sync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources

Guoyi Zhao, Lixin Gao, and David IrwinBigData 2018

Google Sites

Report abuse