Teleporting VMs across continents without downtime

At webapp.io, we create hundreds of thousands of VMs for our users. We want to be able to move VMs across physical machines without downtime so that we can upgrade things without downtime for our users.

It turns out to be pretty hard to move a VM across physical servers without shutting it off. So let's first talk about an easier but related operation first: Snapshotting a VM to a file.

Turning VMs into files and vice-versa

It's not actually that hard to turn VMs into files and vice-versa

VMs have a lot of "state", primarily the state of the disk (the files) and the state of the memory (the running programs)

Hypervisors (the programs which create VMs) generally provide an operation along the lines of "save all of the state to a destination." All you have to do is pause the VM, ask the hypervisor to stream the state into a file, and then restore from that file. The literal example a popular hypervisor gives is (qemu) migrate "exec:cat > memstate"

Steps to snapshot a VM to a file:

  1. Pause the VM
  2. Stream the memory state to a file
  3. Save the disk ( cp old-disk /backups/1/disk.backup)
  4. Wake the VM.

That's exactly how providers like DigitalOcean do it when you click "snapshot VM."

Why not just stream the VM to a file, copy the file onto another server, and wake it up there?

It's too slow for most use-cases. It can take 5 or more minutes to do these four steps, and the VM would be hibernated the entire time. Most people can't afford 5 minute downtimes at random on their VMs, so we need to find a more efficient approach.

Half the battle: Network disks

If we could avoid copying the entire disk of the VM across servers, we could save a lot of time.

Luckily, filesystems like nfs or their cloud-provider cousins like AWS EBS are mature and relatively easy to set up.

All we'd have to do is make the new VM look at the disk (via NFS) from the old server, and we wouldn't have to copy anything at all.

Making things faster: CacheFS

If you teleported a VM from the US to Europe without relocating the disk, things would be pretty slow. Every time the VM asked for a file, it'd have to cross the Atlantic twice!

In an ideal world, the disk would be "copied up" incrementally. The first time you read a file, it'd have to be pulled from the network drive, but subsequent reads would come from the same continent.

It'd be pretty hard to implement "transparent file copy-up" yourself, but there's luckily a Linux kernel feature which has it built in: cachefs.

By using CacheFS, we can incrementally move the disk as files are read and written, and after a few minutes achieve the same performance on our new server as we did on our old one.

Streaming memory state without pausing

By using nfs with a "copy up cache" like CacheFS above, we can "lazily" transfer the files from one continent to another.

Can we do the same for the memory state of the VM? This is the hardest part of our quest, and it has some prerequisite knowledge:

Memory pages and dirty bits

In almost every VM you use, memory is split into fixed-sized chunks called pages. If a computer has 4GB of memory, that would usually be represented to the operating system as "1 million 4096 byte pages"

Hypervisors often want to know when pages are being used, so they track an array of booleans called the "dirty block bitmap." You can set the bit corresponding to a certain chunk of memory to 0, and the hypervisor will set it to 1 when that chunk is written to.

Here, the computer has 4GB of memory, so there are 1 million pages of 4,096 bytes each. We ask the hypervisor to set the "written to" bit to false for all of the pages, and then set it to be "true" when the memory is written to

var my_arr[4096]

my_arr[0] = 5

Let's say the program above executes, and my_arr happens to be in page 100 (the exact details of which page is chosen doesn't matter much)

If we checked the dirty block bitmap now, we'd see that page[100].is_dirty is true! The hypervisor has noticed that we wrote to that page, and automatically marked it as "dirty" or "recently written to" for us.

All this to say, the hypervisor can tell you when a specific block of memory has changed since you last looked at it.

Can you transfer a VM's memory without pausing it?

Yes! (Using dirty bits) - the algorithm is really quite conceptually simple.

for i = 0; i < num_pages; i += 1 {
  send_page_to_other_server(pages[i])
  pages[i].dirty = false;
}
vm.pause()
for i = 0; i < num_pages; i += 1 {
  if(pages[i].dirty) {
    send_page_to_other_server(pages[i])
  }
}
wake_up_vm_on_other_server()

We transfer the entire memory once, remembering what we've sent by setting the dirty bit to false, and then pause the VM and only re-send any pages which were changed in the meanwhile. We know exactly what's changed because those are the pages that have the dirty bit set to true.

Combined with the network disks above, this means we can almost instantaneously transfer the VM state.

Example (qemu)

QEMU is one of the most popular hypervisors that implements this sort of migration, and it's relatively simple to run.

#On server1, run the VM:
qemu-system-x86_64 \
	-m "512m" \
	-monitor "tcp:127.0.0.1:44531,server,nowait" \
	-drive "id=root,file=/mnt/networkdrive/disk.qcow2,format=qcow2,if=none" \
	-device "virtio-blk-device,drive=root"

        
# On server2, run a blank "recipient VM"
qemu-system-x86_64 \
	-m "512m" \
	-monitor "tcp:127.0.0.1:44531,server,nowait" \
	-drive "id=root,file=/mnt/networkdrive/disk.qcow2,format=qcow2,if=none" \
	-device "virtio-blk-device,drive=root"
	-incoming tcp:0:4444

# On server1, initiate the migration. The "monitor" flag above lets us connect to port 44531 to run administration commands on the VM

telnet localhost 44531
> migrate -d tcp:<server2 ip>:4444

Conclusion

It's possible to teleport VMs across continents with little to no downtime, and little to no performance impact. Doing so lets you do all sorts of things you wouldn't otherwise be able to:

  • You can upgrade CPU or memory for a VM, by moving the VM to a more powerful machine.
  • If your customers are worldwide (like ours!) you can dynamically move API servers around the world depending on where people are making requests from.
  • You can maintain servers without causing any meaningful downtime to their VMs. That means you can hypothetically have VMs that never completely shut down!