Raunaq Pahwa
CS Grad at SBU, NY

Building a Remote Code Execution Service

May 15, 2024 (8mo ago)

546 views

8 min read

I’ve always been intrigued by online code execution platforms like Leetcode & Hackerrank. I decided to build a service to understand what goes into the code execution functionality of these platforms.

I’ll mainly discuss the low-level design and implementation in Java and Spring Boot.

Object Model

Endpoints

There are three POST endpoints that all take the code as a string in the request body. The service supports Java, Python and Javascript and can easily extend to other languages. A GET endpoint /info returns the supported languages and their versions.

@RestController
class ExecutionController {

  private final CodeExecutionServiceFactory codeExecutionFactory;

  @PostMapping(value = "/python")
  public CodeResult executePythonCode(@RequestBody Code code) throws CodeSizeLimitException,
            TimeLimitException, ContainerNotCreatedException {
    return codeExecutionFactory.executeCode(Constants.EXECUTE_PYTHON, code.getRawCode());
  }

  @GetMapping(value = "/info")
  public List<LanguageInfo> getSupportedLanguages() {
    return Arrays.stream(ImageInfo.values()).map(ImageInfo::getLanguageInfo)
            .sorted(Comparator.comparing(LanguageInfo::getLanguageName)).toList();
  }
}

The CodeExecutionServiceFactory class follows the Factory Design Pattern and returns the CodeExecutionService bean associated with the constant that’s passed as the first argument of the executeCode method (Python above).

Service Classes

The CodeExecutionService is an abstract class. It contains various abstract methods that are overridden in programming language-specific child classes. The abstract methods are responsible for different containers and code execution commands required by different languages.

public CodeResult executeCode(String rawCode) throws CodeSizeLimitException,
        TimeLimitException, ContainerNotCreatedException {
  /* Get code size by converting the string to bytes 
  and throw an exception if it exceeds 1MB */
  var codeSize = codeSizeMb(rawCode);
  if (codeSize > Constants.CODE_SIZE_LIMIT) {
    throw new CodeSizeLimitException(Constants.CODE_SIZE_LIMIT_EXCEPTION);
  }
  // Get the container type from the derived class
  var container = containerManagerService.createContainer(getContainerType());
  containerManagerService.startContainer(container);

  // Escape any tokens before sending the code to execute to the container
  return executeCodeInContainer(container, getEscapeFunc().apply(rawCode));
}

Note- The inheritance in this case is only one level deep as the service is simple. In more complex scenarios, composability with interfaces using language-specific strategies will be a better choice.

The ContainerManager class is responsible for the lifecycle of the container. It contains methods for creating, starting, removing and increasing the removal priority of containers. It also contains an asynchronous method for removing ghost containers, something I’ll discuss further.

public DockerContainer createContainer(String language) throws ContainerNotCreatedException {
  // Create a language-specific container
  var container = containerCreatorServiceFactory.createContainer(language);

  /* Add it to the container priority queue (synchronize as
  multiple request threads access the queue) */
  synchronized(containerQueue) {
    containerQueue.offer(container);
  }
  return container;
}

The initial design for removing containers involved querying the docker daemon for the running containers and removing those that had been running longer than a set duration. It was an async method that executed every 5 seconds using the @Scheduled annotation from Spring.

@Scheduled(fixedRate = 5, timeUnit = TimeUnit.SECONDS)
public void terminateContainers() {
  var currentTime = System.currentTimeMillis() / 1000;
  var containers = client.listContainersCmd().exec();
  // Bottleneck (Querying the daemon and sorting by created time, every 5 seconds)
  containers.sort(Comparator.comparingLong(Container::getCreated));
  for (var container: containers) {
    var timeDiffInSeconds = currentTime - container.getCreated();
    if (timeDiffInSeconds >= 30) {
      terminateContainer(container.getId());
    } else {
      break;
    }
  }
}

However, queries to the docker daemon are performed over HTTP via loopback. Since all the containers are created through the application, a better option is to keep a pool of containers within the application. This way you can instruct the daemon to terminate the containers with the containerId without querying for the running containers every time.

To do this, we use a priority queue called containerQueue. It prioritizes containers with a high removal priority followed by containers that have been running for a longer duration.

@Bean
public Queue<DockerContainer> createContainerQueue() {
  // enum RemovalPriority { HIGH, LOW } 
  return new PriorityQueue<>(Comparator.comparing(DockerContainer::getRemovalPriority)
          .thenComparing(c -> c.getContainer().getCreated()));
}

The removal priority is an enum that has two values HIGH and LOW. The priority is increased from LOW to HIGH for long-running containers when the code doesn’t execute in the set timeout duration for each language like infinite loops.

var result = false;
// Run the code and wait for it's execution for timeout seconds (language dependent)
try {
  result = client.execStartCmd(execCreateCmdResponse.getId()).exec(resultCallbackTemplate)
          .awaitCompletion(timeout, TimeUnit.SECONDS);
} catch (InterruptedException e) {
  codeResultBuilder.appendExceptions(e.getMessage());
}

// Increase removal priority and throw an exception
if (!result) {
  containerManagerService.increaseRemovalPriority(container);
  throw new TimeLimitException(String.format(Constants.TIME_LIMIT_EXCEPTION, timeout));
}

One way of writing the removal method would be to take a lock on the containerQueue, identify all the containers that have been running for 30 seconds, and scheduling them for removal. However, this method maintains the lock on the containerQueue throughout, making it inaccessible for other requests to add containers.

The stopAndRemoveContainer method is asynchronous and returns a CompletableFuture<Boolean>, true if the removal was successful and false otherwise. Performing the termination synchronously would’ve led to a delayed API response, as docker containers take a while to shut down.

/* Bottleneck (The container queue is locked throughout, it'll 
block the addition of new containers to the queue) */
synchronized (containerQueue) {
  while (!containerQueue.isEmpty()) {
    var container = containerQueue.peek();
    var timeDiffInSeconds = currentTime - container.getContainer().getCreated();
    if (timeDiffInSeconds >= 30) {
      containerQueue.poll();
      containerRemovalService.stopAndRemoveContainer(container.getContainer()).thenAccept(isRemoved -> {
        if (isRemoved) {
          // Increment the available containers of that type
          containerCreatorServiceFactory.incrementAvailableContainers(container);
        } else {
          // If removal fails, add it back to the queue to try again
          containerQueue.offer(container);
        }
      });
    } else {
      break;
    }
  }
}

A better way of writing the same code would be to get the lock on the containerQueue, extract some containers that have exceeded the time limit, release the lock and then try removing these containers with a released lock. This helps free up the containerQueue for other operations like the addition of new containers to the queue for other requests.

var containersToRemove = new ArrayList<DockerContainer>();
// Synchronize (take a lock) on containerQueue
synchronized (containerQueue) {
  int numContainersToRemove = Math.min(containerQueue.size(), 3);
  for (int i = 0; i < numContainersToRemove; i++) {
    var container = containerQueue.peek();
    var timeDiffInSeconds = currentTime - container.getContainer().getCreated();
    // Add to removal list if time difference exceeds 30 seconds
    if (timeDiffInSeconds >= 30) {
      containersToRemove.add(containerQueue.poll());
    } else {
      break;
    }
  }
}
// Lock released (containerQueue is free for access by other requests)
for (var container: containersToRemove) {
  // Try removing containers
  containerRemovalService.stopAndRemoveContainer(container.getContainer()).thenAccept(isRemoved -> {
    if (isRemoved) {
      containerCreatorServiceFactory.incrementAvailableContainers(container);
    } else {
      // If removal fails, synchronize, add to containerQueue again
      synchronized (containerQueue) {
        containerQueue.offer(container);
      }
    }
  });
}

A scenario where a disparity might arise between the containers in the containerQueue and those within the Docker daemon occurs when the application crashes while there are pending removals of some containers. When the Spring Boot application restarts, the containerQueue will be empty and containers will keep running within the daemon.

To prevent this, we include an additional method called stopAndRemoveGhostContainers that runs every 10 minutes. I’ve used the term ghost containers as they keep running in the background and hogging resources without any use. This method queries the Docker daemon for running containers and terminates those that have been running for longer than 10 minutes.

@Scheduled(fixedRate = 10, timeUnit = TimeUnit.MINUTES)
public void stopAndRemoveGhostContainers() {
  var currentTime = (double) System.currentTimeMillis() / 1000.0;
  var containers = client.listContainersCmd().exec();
  containers.sort(Comparator.comparing(Container::getCreated));
  for (var container: containers) {
    var timeDiffInMinutes = (currentTime - container.getCreated()) / 60.0;
    if (timeDiffInMinutes >= 10) {
      containerRemovalService.stopAndRemoveContainer(container);
    } else {
      break;
    }
  }
}

Exceptions

Exceptions are handled through a @RestControllerAdvice which centralizes exception handling. However, one shortcoming with this approach is that all the methods in the code flow have a throws statement in them. This is fine as long as the types of exceptions are fewer but can quickly become unmanageable. An alternative is to pass a result object in the service classes or store it in the Spring context and populate it whenever an error occurs in the code flow.

@RestControllerAdvice
public class ApplicationExceptionHandler {

  @ExceptionHandler({CodeSizeLimitException.class, TimeLimitException.class})
  public ResponseEntity<CodeResult> handleLimitExceptions(Exception exception) {
    CodeResult codeResult = new CodeResult.Builder().appendExceptions(exception.getMessage()).build();
    return new ResponseEntity<>(codeResult, HttpStatus.BAD_REQUEST);
  }

  @ExceptionHandler({ContainerNotCreatedException.class})
  public ResponseEntity<CodeResult> handleContainerNotCreatedException(Exception exception) {
    CodeResult codeResult = new CodeResult.Builder().appendExceptions(exception.getMessage()).build();
    return new ResponseEntity<>(codeResult, HttpStatus.SERVICE_UNAVAILABLE);
  }
}

Conclusion

While a full-fledged platform like Leetcode involves tracking users, code execution metrics, stringent security mechanisms, leaderboards and several other functionalities, this service captures the essence of the core functionality well.

You can check out the full code here Remote Code Execution Service.