Tristan Hume

Github Resume + Project List Blog

Teleforking a process onto a different computer!

One day a coworker mentioned that he was thinking about APIs for distributed compute clusters and I jokingly responded “clearly the ideal API would be simply calling telefork() and your process wakes up on every machine of the cluster with the return value being the instance ID”. I ended up captivated by this idea: I couldn’t get over how it was clearly silly, yet way easier than any remote job API I’d seen, and also seemingly not a thing computers could do. I also kind of knew how I could do it, and I already had a good name which is the hardest part of any project, so I got to work.

In one weekend I had a basic prototype, and in another weekend I had a demo where I could telefork a process to a giant VM in the cloud, run a path tracing render job on lots of cores, then telefork the process back, all wrapped in a simple API.

Here’s a video of it running a render on a 64 core cloud VM in 8 seconds (plus 6s each for the telefork there and back). The same render takes 40s running locally in a container on my laptop:

How is it possible to teleport a process? That’s what this article is here to explain! The basic idea is that at a low level a Linux process has only a few different parts, and for each of them you just need a way to retreive it from the donor, stream it over the network, and copy it into the cloned process.

You may be thinking, “but wait, how can you replicate [some reasonable thing like a TCP connection]?” Basically I just don’t replicate tricky things so that I could keep it simple, meaning it’s just a fun tech demo you probably shouldn’t use for anything real. It can still teleport a broad class of mostly computational programs though!

What does it look like

I wrote it as a Rust library but in theory you could wrap it in a C API and then use it via FFI bindings to teleport even a Python process. The implementation is only about 500 lines of code (plus 200 lines of comments) and you use it like this:

use telefork::{telefork, TeleforkLocation};

fn main() {
    let args: Vec<String> = std::env::args().collect();
    let destination = args.get(1).expect("expected arg: address of teleserver");

    let mut stream = std::net::TcpStream::connect(destination).unwrap();
    match telefork(&mut stream).unwrap() {
        TeleforkLocation::Child(val) => {
            println!("I teleported to another computer and was passed {}!", val);
        }
        TeleforkLocation::Parent => println!("Done sending!"),
    };
}

I also provide a helper called yoyo that teleforks to a server, executes a closure you give it, then teleforks back. This provides the illusion that you can easily run a snippet of code on a remote server, perhaps one with much more compute power available.

// load the scene locally, this might require loading local scene files to memory
let scene = create_scene();
let mut backbuffer = vec![Vec3::new(0.0, 0.0, 0.0); width * height];
telefork::yoyo(destination, || {
  // do a big ray tracing job on the remote server with many cores!
  render_scene(&scene, width, height, &mut backbuffer);
});
// write out the result to the local file system
save_png_file(width, height, &backbuffer);

Anatomy of a Linux process

Let’s look at what a process on Linux (the OS telefork works on) looks like:

Anatomy of a process diagram

So we can make a basic telefork implementation by just figuring out how to recreate the memory mappings and main thread registers. This should handle simple programs that mostly do computation without interacting with OS resources like files (in a way that needs to be teleported, opening a file on one system and closing it before calling telefork is fine).

How to telefork a process

I wasn’t the first to think of the possibility of recreating a process on another machine. I emailed @rocallahan, the author of the rr record and replay debugger to ask some questions since rr does some very similar things to what I wanted to do. He let me know of the existence of CRIU, which is an existing system that can stream a Linux process to a different system, designed for live migrating containers between hosts. CRIU supports restoring all sorts of file descriptors and other state, however the code was really complex and used lots of syscalls that required special kernel builds or root permissions. Linked from the CRIU wiki page I found DMTCP which was built for snapshotting distributed supercomputer jobs so they could be restarted later, and it had easier to follow code.

These didn’t dissuade me from trying to implement my own system since they’re super complex and require special runners and infrastructure, and I wanted to show how simple a basic teleport can be and make it just a library call. So I read pieces of source code from rr, CRIU, DMTCP, and some ptrace examples, and put together my own telefork procedure. My method works in its own way that’s a hodgepodge of different techniques.

In order to teleport a process, there’s both work that needs to be done in the source process which calls telefork, and at the call to the function which receives a streamed process on the server and recreates it from the stream (telepad). These can happen concurrently, but it’s also possible to do all the serializing before loading, for example by dumping to a file then loading later.

Below is a simplified overview of both processes, if you want to know exactly how everything happens I encourage you to read the source. It’s heavily commented, all in one file, and ordered so you can read it top to bottom to understand how everything works.

Sending a process using telefork

The telefork function is given a writeable stream over which it sends all the state of its process.

  1. Fork the process into a frozen child. It can be hard for a process to inspect its own state since as it inspects the state things like the stack and registers change. We can avoid this by using a normal Unix fork and then have the child stop itself so we can inspect it.
  2. Inspect the memory mappings. This can be done by parsing /proc/<pid>/maps to find out where all the memory maps are. I used the proc_maps crate for this.
  3. Send the info for special kernel maps. Based on what DMTCP does, instead of copying the contents of special kernel maps we remap them, and this is best done before the rest of the mapping so we stream them first without their contents. These special maps like [vdso] are used to make certain syscalls like getting the time faster.
  4. Loop over the other memory maps and stream them to the provided pipe. I first serialize a structure containing info about the mapping and then I loop over the pages in it and use the process_vm_readv syscall to copy memory from the child to a buffer, then write that buffer to the channel.
  5. Send the registers. I use the PTRACE_GETREGS option for the ptrace syscall, which allows me to get all register values of the child process. Then I just write them in a message over the pipe.

Running syscalls in a child process

In order to mold a target process into a copy of the incoming process we’ll need to get the process to execute a bunch of syscalls on itself without having access to any code, because we’ve deleted it all. Here’s how I do remote syscalls using ptrace, which is a versatile syscall for manipulating and inspecting other processes:

  1. Find a syscall instruction. You need at least one syscall instruction for the child to execute to be in an executable mapping. Some people patch one in, but instead I use process_vm_readv to read the first page of the kernel [vdso] mapping, which as far as I know contains at least one syscall in all Linux versions so far, and then search through the bytes for its offset. I only do this once and update it when I move the [vdso] mapping.
  2. Set up the registers to execute a syscall using PTRACE_SETREGS. The instruction pointer points to the syscall instruction, rax holds the Linux syscall number, and rdi, rsi, rdx, r10, r8, r9 hold the arguments.
  3. Step the process one instruction using the PTRACE_SINGLESTEP option to execute the syscall instruction.
  4. Read the registers using PTRACE_GETREGS to retreive the syscall return value and see if it succeeded.

Receiving a process using telepad

Using this primitive and ones I’ve already described we can recreate the process:

  1. Fork a frozen child. Similar to sending except this time we need a child process we can manipulate to turn it into a clone of the process being streamed in.
  2. Inspect the memory mappings. This time we need to know all the existing memory maps so we can remove them to make room for the incoming process.
  3. Unmap the existing mappings. We loop over each of the mappings and manipulate the child process into calling munmap on them.
  4. Remap the special kernel mappings. Read their destinations from the stream and use mremap to remap them to their target destination.
  5. Stream in the new mappings. Use remote mmap to create the mappings, then process_vm_writev to stream memory pages into them.
  6. Restore the registers. Use PTRACE_SETREGS to restore the registers for the main thread that were sent over, with the exception of rax which is the return value for the raise(SIGSTOP) that the snapshotted process stopped on, which we overwrite with an arbitrary integer passed to telepad.
    • The arbitrary value is used so the telefork server can pass the file descriptor of the TCP connection the process came in on, so that it can send data back, or in the case of yoyo execute a telefork back over the same connection.
  7. Restart the process with its brand new contents by using PTRACE_DETACH.

Doing more things properly

There’s a few things that are still broken in my implementation of telefork. I know how to fix them all, but I’m satisfied with how much I’ve implemented and sometimes they’re tricky to fix. This describes a few interesting examples of those things:

Even crazier ideas

I think this shows that some crazy things you might have thought weren’t possible can in fact be done given the right low level interfaces. Here’s some ideas extending on the basic telefork ideas that are totally possible to implement, although perhaps only with a very new or patched kernel:

Conclusion

I think this stuff is really cool because it’s an instance of one of my favourite techniques, which is diving in to find a lesser-known layer of abstraction that makes something that seems nigh-impossible actually not that much work. Teleporting a computation may seem impossible, or like it would require techniques like serializing all your state, copying a binary executable to the remote machine, and running it there with special command line flags to reload the state. But underneath your favourite programming language there’s a layer of abstraction where you can choose a fairly simple subset of things that make it possible to teleport at least most pure computations in any language in 500 lines of code and a single weekend. I think this kind of diving down often leads to solutions that are simpler and more universal. Another one of my projects like this is Numderline.

Of course, they often seem like extremely cursed hacks and to a large extent they are. They do things in a way nobody expects, and when they break they break at a layer of abstraction they aren’t supposed to break at, like your file descriptors mysteriously dissapearing. Sometimes though you can hit the layer of abstraction just right and handle all the cases such that everything is seamless and magic, I think good examples of this are rr (although telefork manages to be cursed enough to segfault it) and cloud VM live migration (basically telefork at the hypervisor layer).

I also like thinking about these things as inspiration for alternative ways computer systems could work. Why are our cluster computing APIs so much more difficult to use than just running a program that broadcasts functions to the cluster? Why is networked systems programming so much harder than multithreaded programming? Sure you can give all sorts of good reasons, but they’re mostly based on how difficult it would be given how other existing systems work. Maybe with the right abstraction or with enough effort a project could seamlessly make it work, it seems fundamentally possible.