Optimizing Distributed Machine Learning for Transient Servers

Large-scale distributed machine learning (ML) is expensive to run on cloud platforms. To reduce cost, cloud platforms now offer cheap transient servers, which they may revoke at any time. This project is re-designing traditional distributed ML algorithms to use looser forms of synchrony, and designing adaptive policies for selecting transient servers based on their performance, cost, and volatility to efficiently run distributed ML on transient servers. This project is funded by the National Science Foundation under grant CNS-1908536.