I've recently been going down the rabbit hole of creating a "fast start dev env" and it's interesting to see how this article differs from other approaches (codesandbox has some fantastic blogs, the fly.io blog on sprites has interesting pointers, e2b and daytona are related open source tools). Everyone has a different solution based on their tradeoffs.
I thought the memory snapshotting part in particular was clever since most container based systems don't bother (VM/firecracker based ones can use UFFD and call it a day), but by having emulated syscalls you can actually do single-process restore pretty well.
I am a bit dubious of the use of fuse (though it clearly works well!), and I wonder if ublk (what I ended up using) might alleviate some of the pain/magic in fuse tuning. I'd personally also be looking at forking gvisor to take a memfd which you enable UFFD on for the page loading (I have some firecracker patches where I do the same). It's nice because you can optimistically push pages, rather than waiting for the requests to come in. The series of three codesandbox blog posts are good background reading.
Our system somewhat predates ublk; at the time we wrote it, FUSE was the most reasonable option. Moving over to ublk would require re-architecting around block devices rather than filesystems, but is indeed something we're looking at!
There are plenty of cool advancements in reducing inference cold start when I was meeting with folks in person at FOSDEM this year. However, I still struggle to understand: why would folks care about this?
Major AI Labs all have secured their own compute in the form of hardware, data center, and power generation. That means their resource pool is fixed, and they can do all sorts of tricks to pre-load, pre-allocate, etc... to improve on inference latency.
Cold start is usually a solution for "cloud" environment when your pool is flexible, and you only pay for what you use. Its effectiveness lowered in bare-metal settings as folks do not care about scaling up and down as much.
So my question is: who is this for? AWS and GCP running Anthropic models?
At least folks like me care about it. My local hardware is more than enough to handle my app, but given Spectrum's internet service is as fickle as a broken fiddle I'm forced to rent a dedicated cloud gpu that sits idle most days. However, I would save a serious chunk of change if I could boot up a GPU snapshot in ~10s. I evaluated various options a while back and, while modal.com was the fastest, it still took around a minute-ish. Granted, my use case is unique, but I imagine this could be a decent solution for gpu-poor ComfyUI users.
So, does this snapshotting optimization support arbitrary containers?
I'm currently planning to deploy using Amazon SageMaker, but a cold start takes a whopping ~9 minutes: 6 minutes for instance provisioning + 3 minutes for PyTorch initialization. My Docker image is ~14 GB, and the weights are a few GB.
How long would it take to cold start this configuration on
Modal?
SageMaker's performance makes it pretty much useless without many warm instances around (= tens of thousands of dollars per month), because users won't be happy if they have to randomly wait 9 minutes
Yep! That should start in ten seconds or so -- about a second per gigabyte of weights, plus a second to start the container and a few seconds to load the memory snapshot.
charles, amit, can you go into more about the path based caching? Particularly "shared bytes aren’t guaranteed to be in the exact same container image layer"? I've built something that solves issues around sharing data between layers, and am interested to see if it fits usecases like Modal's.
Edit: "The solution is to disaggregate the container launcher (runc for Docker, runsc for gVisor) from the container image delivery" is exactly what I've done! I've not built a lazy FUSE on top of it (yet! except for cache mounts in BuildKit), but it's on my TODO list. I guess I'm mainly curious what stops bytes from being shared in your case.
To clarify: we do content-based hashing, and when we say "shared bytes aren’t guaranteed to be in the exact same container image layer", what we mean is that
FROM some/image
RUN pip install torch==2.7.1
and
FROM another/image
RUN pip install torch==2.7.1
will produce images with very high overlap in contents, which will be shared by a content-based cache, but those images' final layers are disjoint from the perspective of a layerwise cache.
How can you cut latency by more than 1x? I am no intending to be snarky, it just doesn’t fit my brain how you can reduce a measure time by more than the original starting time.
I've recently been going down the rabbit hole of creating a "fast start dev env" and it's interesting to see how this article differs from other approaches (codesandbox has some fantastic blogs, the fly.io blog on sprites has interesting pointers, e2b and daytona are related open source tools). Everyone has a different solution based on their tradeoffs.
I thought the memory snapshotting part in particular was clever since most container based systems don't bother (VM/firecracker based ones can use UFFD and call it a day), but by having emulated syscalls you can actually do single-process restore pretty well.
I am a bit dubious of the use of fuse (though it clearly works well!), and I wonder if ublk (what I ended up using) might alleviate some of the pain/magic in fuse tuning. I'd personally also be looking at forking gvisor to take a memfd which you enable UFFD on for the page loading (I have some firecracker patches where I do the same). It's nice because you can optimistically push pages, rather than waiting for the requests to come in. The series of three codesandbox blog posts are good background reading.
Our system somewhat predates ublk; at the time we wrote it, FUSE was the most reasonable option. Moving over to ublk would require re-architecting around block devices rather than filesystems, but is indeed something we're looking at!
Fair enough, would love to see another writeup on the performance you observe even if it fails - in practice numbers are hard to come by.
There are plenty of cool advancements in reducing inference cold start when I was meeting with folks in person at FOSDEM this year. However, I still struggle to understand: why would folks care about this?
Major AI Labs all have secured their own compute in the form of hardware, data center, and power generation. That means their resource pool is fixed, and they can do all sorts of tricks to pre-load, pre-allocate, etc... to improve on inference latency.
Cold start is usually a solution for "cloud" environment when your pool is flexible, and you only pay for what you use. Its effectiveness lowered in bare-metal settings as folks do not care about scaling up and down as much.
So my question is: who is this for? AWS and GCP running Anthropic models?
At least folks like me care about it. My local hardware is more than enough to handle my app, but given Spectrum's internet service is as fickle as a broken fiddle I'm forced to rent a dedicated cloud gpu that sits idle most days. However, I would save a serious chunk of change if I could boot up a GPU snapshot in ~10s. I evaluated various options a while back and, while modal.com was the fastest, it still took around a minute-ish. Granted, my use case is unique, but I imagine this could be a decent solution for gpu-poor ComfyUI users.
I work in a slightly different domain and I focus a lot on optimizing coldstarts.
Here's my 2cents: improve cold starts also means utilizing resources more effectively.
From cloud providers to end users - every ms both adds up and translates to additional waste of electricity/hardware and costs.
So, does this snapshotting optimization support arbitrary containers?
I'm currently planning to deploy using Amazon SageMaker, but a cold start takes a whopping ~9 minutes: 6 minutes for instance provisioning + 3 minutes for PyTorch initialization. My Docker image is ~14 GB, and the weights are a few GB. How long would it take to cold start this configuration on Modal?
SageMaker's performance makes it pretty much useless without many warm instances around (= tens of thousands of dollars per month), because users won't be happy if they have to randomly wait 9 minutes
Yep! That should start in ten seconds or so -- about a second per gigabyte of weights, plus a second to start the container and a few seconds to load the memory snapshot.
There are a few limitations with snapshotting, e.g. it generally fails when using multiple GPUs, which we document here: https://modal.com/docs/guide/memory-snapshots.
charles, amit, can you go into more about the path based caching? Particularly "shared bytes aren’t guaranteed to be in the exact same container image layer"? I've built something that solves issues around sharing data between layers, and am interested to see if it fits usecases like Modal's.
Edit: "The solution is to disaggregate the container launcher (runc for Docker, runsc for gVisor) from the container image delivery" is exactly what I've done! I've not built a lazy FUSE on top of it (yet! except for cache mounts in BuildKit), but it's on my TODO list. I guess I'm mainly curious what stops bytes from being shared in your case.
To clarify: we do content-based hashing, and when we say "shared bytes aren’t guaranteed to be in the exact same container image layer", what we mean is that
FROM some/image RUN pip install torch==2.7.1
and
FROM another/image RUN pip install torch==2.7.1
will produce images with very high overlap in contents, which will be shared by a content-based cache, but those images' final layers are disjoint from the perspective of a layerwise cache.
Thank you
What is "cutting by 40x" supposed to mean?
Cutting latencies by 40x! Unfortunately couldn't fit the whole title in the character limit :<
How can you cut latency by more than 1x? I am no intending to be snarky, it just doesn’t fit my brain how you can reduce a measure time by more than the original starting time.
There are two ways to express such ratios, and both are equally valid. (Though "x" is usually reserved for "40x" and "%" for "97.5%".)
Put differently, 1/40 is not the same as 1x - 40x. I’d phrase as Reduced by 97.5% or 0.975x
probably just AI slop and using wrong semantics, they mean speedup ratio.
You're absolutely right!