Hacker News new | past | comments | ask | show | jobs | submit login

Could you explain more? Sounds like a fun problem.



I'll do my best! Dodging proprietary bits (and bad term usage) along the way :D

First, some background: we had to pivot from one NFS instance to another. Assume the data was already consistent.

The goal being to minimize observable disruption to the guests. We can pause time, but can't restart the instances -- the services involved should be unaware.

Processes hold onto files they have open. This is pretty well understood - many have heard of file descriptors

These are very sticky -- particularly for things with mounted filesystems. This is where my path to glory appeared

The thinking was... as long as that path was there when the VM process was resumed, we'd be fine...

In reality, we weren't! The kernel isn't really concerned with the fully qualified path.

From the example above, /somepath is really just like "mount ID 2" to the kernel.

In the end we had to renew those file descriptors, consequently picking up the new mount IDs

We ended up pausing the instances, saving the memory state locally, swapping the mounts, and then resuming the VMs.

Time only briefly skipped, and we successfully moved thousands of instances from one NFS 'host' to another




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: