From reading this launch post, I'm not convinced this is going to save too much money.
The project automatically selects the cheapest cloud to run a job, and does it there - which sounds sensible. In reality though, these jobs presumably need large volumes of input data. If your input data is in cloud A, and you run a job in cloud B, typically any cost saving from running in cloud B will be more than offset by the egress cost to get the data out of cloud A.
This project is therefore only useful for scenarios where you need to do large amounts of compute on relatively small volumes of data. Is that really a common scenario?
I'm one of the creators of Skyplane. Skyplane can migrate large datasets between cloud regions at 10s of Gbps while compressing data to reduce egress fees. Happy to chime in!
Congrats on the launch! I had a similar idea once a few years back but failed to materialize it. You might want to consider other cloud providers like Sushi Cloud to get costs even lower. Happy to do an intro if it seems interesting.
I'm one of the creators of SkyPilot. Thanks for the thoughtful questions and let me try to take a stab:
SkyPilot is not just for multi clouds. It's useful for all of these scenarios:
- using a single region of one cloud
- using multiple regions of one cloud
- using multiple clouds
Data transfer between zones/regions within a cloud is much cheaper than across clouds. We see many users falling in the "one cloud" category and they frequently read 10s of TBs of data across regions to do ML training.
Finally, saving money is one of several key problems we aim to solve, and there are quite a few ways to save other than lots-of-compute-on-small-data. Other reasons why you may want to use a system like SkyPilot include
(1) improving resource availability (big pain point for GPUs/TPUs)
(2) use one interface and know that your jobs can migrate across regions or clouds
And isn’t the biggest issue with running potentially large jobs in the cloud the cut off when it’s cheaper to use your own hardware. After a few months or dozens of runs of your large model in the cloud you may have reached the point where purchasing would have been cheaper.
Something that could look at your code, data and budget and say upto X runs use cloud A, for more than Y runs it would be cheaper to buy/lease these GPUs etc. would be interesting.
I think it’s common to train 100s of models on the same data for experiments. Then you would only need to copy data once to all the cloud storage and run experiments as you wish.
Also most cloud provider don’t charge for ingress so you could move the data from something like R2 to cloud as many times you want..
As outlined in the position paper (linked by another commenter) we believe such tailwinds are increasingly helping foster the "Sky" and making workloads moving between clouds much easier.
The project automatically selects the cheapest cloud to run a job, and does it there - which sounds sensible. In reality though, these jobs presumably need large volumes of input data. If your input data is in cloud A, and you run a job in cloud B, typically any cost saving from running in cloud B will be more than offset by the egress cost to get the data out of cloud A.
This project is therefore only useful for scenarios where you need to do large amounts of compute on relatively small volumes of data. Is that really a common scenario?