I’ve always been intrigued by online code execution platforms like Leetcode & Hackerrank. I decided to build a service to understand what goes into the code execution functionality of these platforms.
I’ll mainly discuss the low-level design and implementation in Java and Spring Boot.
Object Model
- The
Code
class has arawCode
string field that takes the code from the user. - The
CodeResult
class has 3 fields-stdout
for the standard output stream,stderr
for the standard error stream, andexceptions
for Docker and API exceptions. Additionally, it includes a builder for the convenience of constructing the object from streaming responses of the Docker daemon.Note- In languages like Python, both
stdout
andstderr
can have data when an error occurs after something is printed. This is due to its interpreted nature. - The
DockerContainer
class builds on top of theContainer
class provided by the docker-java library. It adds aremovalPriority
enum field to handle long-running containers, which I’ll discuss further. - The
ImageInfo
class stores the details of docker images to use for various languages, and theLanguageInfo
class is a helper for aGET
endpoint that returns the supported programming languages.
Endpoints
There are three POST
endpoints that all take the code as a string in the request body.
The service supports Java, Python and Javascript and can easily extend to other languages. A GET
endpoint /info
returns the supported languages and their versions.
@RestController
class ExecutionController {
private final CodeExecutionServiceFactory codeExecutionFactory;
@PostMapping(value = "/python")
public CodeResult executePythonCode(@RequestBody Code code) throws CodeSizeLimitException,
TimeLimitException, ContainerNotCreatedException {
return codeExecutionFactory.executeCode(Constants.EXECUTE_PYTHON, code.getRawCode());
}
@GetMapping(value = "/info")
public List<LanguageInfo> getSupportedLanguages() {
return Arrays.stream(ImageInfo.values()).map(ImageInfo::getLanguageInfo)
.sorted(Comparator.comparing(LanguageInfo::getLanguageName)).toList();
}
}
The CodeExecutionServiceFactory
class follows the Factory Design Pattern and returns
the CodeExecutionService
bean associated with the constant that’s passed as the
first argument of the executeCode
method (Python above).
Service Classes
The CodeExecutionService
is an abstract class. It contains various abstract methods that are overridden
in programming language-specific child classes. The abstract methods are responsible for different
containers and code execution commands required by different languages.
public CodeResult executeCode(String rawCode) throws CodeSizeLimitException,
TimeLimitException, ContainerNotCreatedException {
/* Get code size by converting the string to bytes
and throw an exception if it exceeds 1MB */
var codeSize = codeSizeMb(rawCode);
if (codeSize > Constants.CODE_SIZE_LIMIT) {
throw new CodeSizeLimitException(Constants.CODE_SIZE_LIMIT_EXCEPTION);
}
// Get the container type from the derived class
var container = containerManagerService.createContainer(getContainerType());
containerManagerService.startContainer(container);
// Escape any tokens before sending the code to execute to the container
return executeCodeInContainer(container, getEscapeFunc().apply(rawCode));
}
Note- The inheritance in this case is only one level deep as the service is simple. In more complex scenarios, composability with interfaces using language-specific strategies will be a better choice.
The ContainerManager
class is responsible for the lifecycle of the container. It contains methods for
creating, starting, removing and increasing the removal priority of containers. It also contains an
asynchronous method for removing ghost containers, something I’ll discuss further.
public DockerContainer createContainer(String language) throws ContainerNotCreatedException {
// Create a language-specific container
var container = containerCreatorServiceFactory.createContainer(language);
/* Add it to the container priority queue (synchronize as
multiple request threads access the queue) */
synchronized(containerQueue) {
containerQueue.offer(container);
}
return container;
}
The initial design for removing containers involved querying the docker daemon for the running containers
and removing those that had been running longer than a set duration. It was an async method
that executed every 5
seconds using the @Scheduled
annotation from Spring.
@Scheduled(fixedRate = 5, timeUnit = TimeUnit.SECONDS)
public void terminateContainers() {
var currentTime = System.currentTimeMillis() / 1000;
var containers = client.listContainersCmd().exec();
// Bottleneck (Querying the daemon and sorting by created time, every 5 seconds)
containers.sort(Comparator.comparingLong(Container::getCreated));
for (var container: containers) {
var timeDiffInSeconds = currentTime - container.getCreated();
if (timeDiffInSeconds >= 30) {
terminateContainer(container.getId());
} else {
break;
}
}
}
However, queries to the docker daemon are performed over HTTP
via loopback. Since all the containers are
created through the application, a better option is to keep a pool of containers within the application. This
way you can instruct the daemon to terminate the containers with the containerId
without
querying for the running containers every time.
To do this, we use a priority queue called containerQueue
. It prioritizes containers with a high removal
priority followed by containers that have been running for a longer duration.
@Bean
public Queue<DockerContainer> createContainerQueue() {
// enum RemovalPriority { HIGH, LOW }
return new PriorityQueue<>(Comparator.comparing(DockerContainer::getRemovalPriority)
.thenComparing(c -> c.getContainer().getCreated()));
}
The removal priority is an enum that has two values HIGH
and LOW
. The priority is increased from LOW
to HIGH
for long-running containers when the code doesn’t execute in the set timeout duration for each
language like infinite loops.
var result = false;
// Run the code and wait for it's execution for timeout seconds (language dependent)
try {
result = client.execStartCmd(execCreateCmdResponse.getId()).exec(resultCallbackTemplate)
.awaitCompletion(timeout, TimeUnit.SECONDS);
} catch (InterruptedException e) {
codeResultBuilder.appendExceptions(e.getMessage());
}
// Increase removal priority and throw an exception
if (!result) {
containerManagerService.increaseRemovalPriority(container);
throw new TimeLimitException(String.format(Constants.TIME_LIMIT_EXCEPTION, timeout));
}
One way of writing the removal method would be to take a lock on the containerQueue
, identify all the
containers that have been running for 30 seconds, and scheduling them for removal. However, this method
maintains the lock on the containerQueue
throughout, making it inaccessible for other requests to add
containers.
The stopAndRemoveContainer
method is asynchronous
and returns a CompletableFuture<Boolean>
, true
if the removal was
successful and false
otherwise. Performing the termination synchronously
would’ve led to a delayed API
response, as docker containers take a while to shut down.
/* Bottleneck (The container queue is locked throughout, it'll
block the addition of new containers to the queue) */
synchronized (containerQueue) {
while (!containerQueue.isEmpty()) {
var container = containerQueue.peek();
var timeDiffInSeconds = currentTime - container.getContainer().getCreated();
if (timeDiffInSeconds >= 30) {
containerQueue.poll();
containerRemovalService.stopAndRemoveContainer(container.getContainer()).thenAccept(isRemoved -> {
if (isRemoved) {
// Increment the available containers of that type
containerCreatorServiceFactory.incrementAvailableContainers(container);
} else {
// If removal fails, add it back to the queue to try again
containerQueue.offer(container);
}
});
} else {
break;
}
}
}
A better way of writing the same code would be to get the lock on the containerQueue
, extract some
containers that have exceeded the time limit, release the lock and then try removing these containers
with a released lock. This helps free up the containerQueue
for other operations like the addition of new
containers to the queue for other requests.
var containersToRemove = new ArrayList<DockerContainer>();
// Synchronize (take a lock) on containerQueue
synchronized (containerQueue) {
int numContainersToRemove = Math.min(containerQueue.size(), 3);
for (int i = 0; i < numContainersToRemove; i++) {
var container = containerQueue.peek();
var timeDiffInSeconds = currentTime - container.getContainer().getCreated();
// Add to removal list if time difference exceeds 30 seconds
if (timeDiffInSeconds >= 30) {
containersToRemove.add(containerQueue.poll());
} else {
break;
}
}
}
// Lock released (containerQueue is free for access by other requests)
for (var container: containersToRemove) {
// Try removing containers
containerRemovalService.stopAndRemoveContainer(container.getContainer()).thenAccept(isRemoved -> {
if (isRemoved) {
containerCreatorServiceFactory.incrementAvailableContainers(container);
} else {
// If removal fails, synchronize, add to containerQueue again
synchronized (containerQueue) {
containerQueue.offer(container);
}
}
});
}
A scenario where a disparity might arise between the containers in the containerQueue
and those within the
Docker daemon occurs when the application crashes while there are pending removals of some containers.
When the Spring Boot application restarts, the containerQueue
will be empty and containers will keep running
within the daemon.
To prevent this, we include an additional method called stopAndRemoveGhostContainers
that
runs every 10 minutes. I’ve used the term ghost containers as they keep running in the background and hogging
resources without any use. This method queries the Docker daemon for running containers and terminates
those that have been running for longer than 10 minutes.
@Scheduled(fixedRate = 10, timeUnit = TimeUnit.MINUTES)
public void stopAndRemoveGhostContainers() {
var currentTime = (double) System.currentTimeMillis() / 1000.0;
var containers = client.listContainersCmd().exec();
containers.sort(Comparator.comparing(Container::getCreated));
for (var container: containers) {
var timeDiffInMinutes = (currentTime - container.getCreated()) / 60.0;
if (timeDiffInMinutes >= 10) {
containerRemovalService.stopAndRemoveContainer(container);
} else {
break;
}
}
}
Exceptions
Exceptions are handled through a @RestControllerAdvice
which centralizes exception handling. However, one
shortcoming with this approach is that all the methods in the code flow have a throws
statement in them.
This is fine as long as the types of exceptions are fewer but can quickly become unmanageable. An alternative
is to pass a result object in the service classes or store it in the Spring context
and populate it
whenever an error occurs in the code flow.
@RestControllerAdvice
public class ApplicationExceptionHandler {
@ExceptionHandler({CodeSizeLimitException.class, TimeLimitException.class})
public ResponseEntity<CodeResult> handleLimitExceptions(Exception exception) {
CodeResult codeResult = new CodeResult.Builder().appendExceptions(exception.getMessage()).build();
return new ResponseEntity<>(codeResult, HttpStatus.BAD_REQUEST);
}
@ExceptionHandler({ContainerNotCreatedException.class})
public ResponseEntity<CodeResult> handleContainerNotCreatedException(Exception exception) {
CodeResult codeResult = new CodeResult.Builder().appendExceptions(exception.getMessage()).build();
return new ResponseEntity<>(codeResult, HttpStatus.SERVICE_UNAVAILABLE);
}
}
Conclusion
While a full-fledged platform like Leetcode involves tracking users, code execution metrics, stringent security mechanisms, leaderboards and several other functionalities, this service captures the essence of the core functionality well.
You can check out the full code here Remote Code Execution Service.