With version 21, Java got virtual (lightweight) threads. This feature has been in the works since 2017, and has received a lot of press over the years. But will it actually change the way you build applications?
The Benefit of Virtual Threads
Let’s start with some mythbusting: threads — of any sort — do not necessarily improve your program’s performance. You have a fixed number of CPU cores, so adding threads obviously won’t help with CPU-bound programs. Less obvious, threads won’t help an I/O-bound program either: once you saturate your network, you can’t move more bytes across the wire, no matter how many threads try to do it.
Where threads can help is if you have a mix of CPU and I/O. For a simple example, let’s say you have a data pipeline that downloads files, performs some CPU-intensive transformation operations on them, and then writes the transformed file somewhere else. If you separate the steps of this pipeline, then you can perform the CPU-intensive activity on one thread while you download files on a second and upload on a third. At some point you may saturate both CPU and I/O, but until that time you’ll improve performance.
That’s the benefit of threads in general, but why virtual threads?
Traditional (platform) threads are independent entities managed by the operating system: a thread starts, runs some code, and eventually stops. During that time, it takes up a slot in the operating system’s process table, and competes with other threads for resources. Most important, that thread continues to take up OS resources even when it’s waiting for something to do.
Virtual threads, by comparison, are managed in “user space”: the JVM schedules work on a (small) pool of platform threads that it owns. I/O operations that would normally block in the kernel, such as reading from a socket, are translated into non-blocking calls behind the scenes (indeed, I suspect this was by far the largest piece of work to make Java virtual threads a reality).
By doing this, virtual threads give you the following benefits over platform threads:
Eliminates execution overhead for short-lived threads
There are some tasks that are inherently parallel. In the field of data engineering, one example processing records from a data source. However, you wouldn’t want to spin up a platform thread to process each record, because you’d spend more time starting the thread than performing the task.
Virtual threads, because they do not involve the operating system, can (or at least should) start up extremely quickly. With them, it’s reasonable to process each record on a separate thread, as an independent task.
Eliminates memory overhead for thread stacks and allocation buffers
As I mentioned above, threads take up a slot in the operating system’s scheduler, and compete for resources even if they’re not actively doing something. But they also consume dedicated memory for their stack, and for the per-thread allocator in the JVM. In a program that spawns lots of threads, this is significant. And most of it is wasted.
Virtual threads don’t use a dedicated stack, and they perform allocations (as far as I can tell) using the underlying platform thread’s allocator. This saves physical RAM, and it also reduces resources used by the operating system for memory management.
Eliminates complex non-blocking code
Whenever your program performs some I/O operation, such as reading or writing a file or socket, it makes a request to the operating system kernel and waits for that request to complete. This wait may be extremely short: under a millisecond to read from an SSD. Or it may be extremely long: perhaps several seconds until a database query finishes and data starts to flow. In either case, these operations block the thread: it’s suspended, and its time slice is given to another thread. Not a problem if you only have a few threads, a big problem is you have hundreds or thousands.
Java has had non-blocking I/O for a long time. Socket channels, introduced with Java 1.4 in 2002, allowed one thread to manage multiple sockets, only calling the kernel when there was something to be done. Java 7, introduced in 2011, added asynchronous I/O for files and sockets, in which you define a callback handler that’s invoked when anything happens. The problem with these mechanisms is that they are complex to understand and use.
By comparison, virtual threads hide the complexity completely: as far as your code is concerned, it’s reading from a normal
java.net.Socket. Behind the scenes, the socket operation is submitted and then your lightweight thread is suspended until it completes. The platform thread that was used to run your virtual thread can then be used for a different virtual thread.
Eliminates thread pool deadlocks
Threadpools typically have a maximum number of threads (and if they don’t the operating system imposes a limit on the number of threads per process). Often, programs set the size of the pool as a multiple of the number of CPU cores. For independent tasks, this is a good solution: you can adjust the size of the pool to handle your expected workload, and eventually everything you submit to the pool will run.
Threadpool deadlock raises its ugly head when the tasks are not independent. For example, you might have a series of tasks, each submitting its successor and then waiting for results. As long as you have fewer tasks than threads, you’re OK. However, it’s all too easy to find yourself in a situation where the thread pool is completely deadlocked, with all of its tasks waiting for other tasks to complete.
The fork-join pool, introduced with Java 8, provides a partial solution to this problem: if task A spawns task B and then waits for it, the pool will run task B on the thread used by task A. But in the general case, where you have non-linear dependencies (or perhaps a message queue in the middle), the fork-join pool will become deadlocked just like a regular thread pool.
When looking for an example, I wanted something that demonstrated the benefits of virtual threads without being overly complex. This ruled out examples involving non-blocking I/O (you won’t see a difference without running hundreds or thousands of threads), as well as the sorts of data engineering transforms that I expect to be my primary use case (anything realistic would bury the threading code). I ended up with a parallel merge sort.
Merge sort is “embarrassingly parallel”: it divides the data to be sorted into two halves, independently sorts each half recursively, and then combines them. My
ThreadSort program, which sorts an array of
double values, turns each partition into a task, that it then runs on a thread pool. You can find the complete program here, but this is the relevant part:
// identify split point Future> future1 = executorService.submit(new Sorter(executorService, data, scratch, low, split)); Future> future2 = executorService.submit(new Sorter(executorService, data, scratch, split, high)); future1.get(); future2.get(); // merge the two halves
I kick off a child thread for each half of the array, and then the task blocks until they’re done. This code can’t run on a traditional thread pool, because it would quickly deadlock or run out of platform threads. It can be run on the fork-join thread pool, which executes either or both child tasks on the same thread as the parent. And it can be run on virtual threads, because the
Future.get() call will release the parent thread.
To create the thread pools, I used functions from the
newWorkStealingPool() for the fork-join pool, and
newVirtualThreadPerTaskExecutor() for the virtual thread pool. The former method uses the number of processors to control the number of platform threads used by the pool. The latter doesn’t expose a mechanism to control parallelism; I’ll assume that the JVM developers picked a reasonable default.
I also wrote an
Executor that ran its tasks inline, and ripped out all of the threading code to get a single-threaded baseline. I also ran the benchmark using the the built-in
Arrays.sort() function. This function uses QuickSort rather than MergeSort; it also has an optimization (using SelectionSort) for sub-arrays that fit in a single cache line.
I ran my tests on my laptop, which has an Intel Core i9-8950HK with 6 cores and 12 hyper-threads at 2.9 GHz; it runs Ubuntu 20.04. I didn’t make any special effort to keep the system quiet while running the tests, but neither did I do anything that would use excessive CPU. For a JVM, I used Amazon Corretto.
My test program executed each sort 60 times. I threw away the numbers from the first 10, to compensate for Hotspot warmup, and averaged the remaining 50. And I tried three different array sizes, so see if that had any effect on the outcome.
All times are milliseconds for a single iteration.
|Implementation||100,000 elements||1,000,000 elements||10,000,000 elements|
|Fork-join thread pool||4||43||555|
I wasn’t expecting those numbers. I thought that virtual threads would be about as fast as the fork-join pool (which it uses under the covers), not 10x slower. And I definitely didn’t expect the largest array to take 100x longer, when the size only increased by 10x. When I looked at the output of the Linux
time command I saw that all of my CPU cores were saturated, as expected. But I also saw a high “system” time value, indicating that virtual threads were making calls to the operating system; that was unexpected.
But, you may respond, that’s a CPU-intensive job! Virtual threads are intended for tasks that have blocking IO!
While I think that’s debatable, it brings up another important point: blocking can be a good thing! If you spin up thousands of virtual threads that each make a request to a server somewhere, then congratulations, you’ve just created a denial-of-service attack. Back-pressure is one of the key features of a well-written distributed application, and limited-size thread or connection pools are one of the easier ways to provide it. If you have an unbounded set of threads that make their own connections, then it becomes the responsibility of downstream servers to provide that back-pressure.
So, bottom line is that I think virtual threads don’t quite live up to the hype. Sure, if you want to write a server that can handle 10,000 concurrent connections, they can help you do that (but, really, why reinvent that particular wheel?). If you know that your downstream server can handle an unbounded number of connections, then they simplify the code to access that server. But for most use-cases, I think a bounded thread pool is the way to go.