Because it will be almost 6 months since I started flirting with the idea to base my thesis on HelenOS somehow, it is time to report some progress.
Actually, there are two reasons for this post. First, I think that something has been implemented and it is time to tell the community about it. And also I feel guilty after JS and JJ complained that they feel lack of communication from students participating on HelenOS. So here you have it ;-).
First about intended goal and target platform: my work is aimed at supporting ccNUMA (cache coherent NUMA) hardware on platform
amd64 (this is represented e.g. by servers with Opteron processors on NUMA capable motherboard). More precisely, my work is targeted at such (hardware) NUMA configurations where HelenOS is able to run without any NUMA support at all. This means that the address space is continuous across individual nodes and the node boundary is transparent to the OS (i.e. information about memory distribution on the nodes is obtained through other channels and is not necessary for correct memory usage and management), the same for CPUs and also that the hardware does not require any special instructions to ensure cache coherency. And, of course, the main goal is to defend my diploma thesis about making HelenOS NUMA-aware. My supervisor for the thesis is Mr. Děcký.
Now, a few words about simulators I use for development (so far I have not tried to run HelenOS on physical NUMA hardware). Most of the time, I use QEMU which is able to simulate NUMA machine, using the
-numa switch. Because it took me some time to understand the syntax of this switch, I post here example command that launches machine with 8 CPUs and 512M of memory, both CPU and memory distributed evenly on 4 nodes:
qemu-system-x86_64 -m 512 -smp 8 \ -numa node,mem=128,cpus=0-1 \ -numa node,mem=128,cpus=2-3 \ -numa node,mem=128,cpus=4-5 \ -numa node,mem=128,cpus=6-7 \ -boot d -cdrom ./image.isoSecond simulator I sometimes use is SimNow created by AMD which allows you to build interactively your own machine (including configuration of low level stuff such as northbridges and various other controllers) and run it. The downside is that the simulator is very precise (in the detail of the simulation) and thus extremely slow.
So, this was a bit of the background of the work and now about what was actually added to the source codes of HelenOS. My branch is registered at Launchpad, under HelenOS team as
bzr branch lp:~vojtech-horky/helenos/numaIf you want to try it, go ahead: check out the branch and in the configure dialog select
amd64from the preconfigured defaults and do not forget to check the
NUMA supportoption (I added several other options which are useful for debugging, I will remove them later). However, do not be disappointed after the machine boots up. There is nothing shiny and new to be seen - all changes are quite low level and currently the only way to see them is to try the
That was the theoretical background. Now a brief summary about what is actually implemented.
First of all, the hardware detection uses ACPI tables (
SLIT) and assigns to each CPU node id (node index). With memory, it is a bit more complicated. I decided not to add any other layer above existing zones (because that would mean changing a lot of code and, furthermore, each node is usually formed by single zone) and only split zones that run across node boundaries. On AMD64, this typically means that only on the first node is memory split into several zones (some of them are reserved) and all other nodes have only single zone. And that is actually all. There were several problems when to read the topology information from ACPI (I might write about it later), but it works now pretty well.
Next, I made some changes to kernel slab allocator. Here, I decided to add extra layer to the allocator (if you want to know the details of the implementation of the allocator, the one in HelenOS is very similar to the one introduced by Jeff Bonwick), that groups the magazines of CPUs of the same node. This is, of course, enabled only in NUMA build (
#ifdefs are used). So far, I have not done any benchmarking so I cannot report whether the change was worth the effort.
To allow more precise thread placement on nodes, I added support for thread affinity. It is nothing extra: each thread has bitmap of CPUs where setting/clearing bit allows the thread to run on respective CPU. The load balancer respects this setting and when the scheduler finds out that it is supposed to schedule thread that cannot run on current CPU, it migrates it away. By default, the thread can run on all CPUs.
For address space areas, I added similar thing plus something that could be called allocation policy. The affinity for AS areas determines from which nodes could be allocated frames that back the area. And the policy specifies how the allocation shall be done. So far, there is first touch (default, allocate from node where is currently running the thread that caused the fault-in) and interleave (allocate from all nodes in the affinity mask evenly, with precise ordering).
Last thing in kernel is automatic page migration (by the way, this is something I am really looking forward to measure). What is it about in a few words? When a thread is migrated on UMA system, its distance to the memory is not changed. However, when it is migrated to different node on NUMA system, it suddenly has the memory much farer. So, obvious remedy is to migrate the memory as well. However, copying huge amount of memory takes some time and it also may not be the best solution for multi threaded programs. Thus, I implemented something that could be called lazy migration: the pages are simply marked as not present and when they are accessed, the are migrated if necessary. As I stated at the beginning of this paragraph, I am very itchy about this and I will definitely write about this more.
And that is all for the kernel side. Still is missing some better load balancing that respects the fact that it is more expensive to migrate thread to different node. And, of course, I need to do some benchmarking. That will be real fun because I do not have many ideas how to create fair benchmarks that would measure performance gains and losses when the difference between accessing local and remote memory is almost imperceptible.
For userspace, there is almost complete port of libnuma API (the one from Linux). The implementation was written from scratch, but the API shall be compatible with the Linux version. There are, of course, some changes due to fact that Linux expects process tree (i.e. the relation parent/child process is well defined and widely used) but the core works. If you want to see it, the already mentioned
numa application has several tests and samples to show what is ready. What is still missing is port of the
numactl(1) that is used to set policies and affinities of new tasks. The problem is again that it expects
exec() style of launching new tasks. This will be probably solved by adding some optional methods to the loader.
Uff, that is all for now. I will try to write some kind of "progress report" on regular basis.