Optimizing Distributed Machine Learning for Transient Servers
Optimizing Distributed Machine Learning for Transient Servers
Large-scale distributed machine learning (ML) is expensive to run on cloud platforms. To reduce cost, cloud platforms now offer cheap transient servers, which they may revoke at any time. This project is re-designing traditional distributed ML algorithms to use looser forms of synchrony, and designing adaptive policies for selecting transient servers based on their performance, cost, and volatility to efficiently run distributed ML on transient servers. This project is funded by the National Science Foundation under grant CNS-1908536.
Publications
Publications
Understanding the Synchronization Costs of Distributed ML on Transient Cloud Resources
Pradeep Ambati, David Irwin, Prashant Shenoy, Lixin Gao, Ahmed Ali-Eldin, and Jeannie AlbrechtIC2E 2019Sync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources
Guoyi Zhao, Lixin Gao, and David IrwinBigData 2018