Linux performance, where's the beef?

Anyone who has had any experience measuring Linux servers under VM and has done the arithmetic might start to be a little suspicious about the claims of running 100's or 1000's of Linux servers on one system. And rightfully so. The reality is that other than one David Boyes (who has not exactly published performance data, but has run 1000's of Linux servers on one s/390 system), nobody is saying that they have actually succeeded in making this work well. In fact, if you listen closely to people talking about Linux under VM, there are lots of installations who have ported Linux applications to run on s/390 under VM, but this is not what we are interested in, is it???? An installation running 3 or 5 Linux servers proves that it works, but not that it is great. And there are lots of installations that are complaining about Linux under VM performance. Enough that if you haven't heard, you are not listening....

The basic performance is in the arithmetic. It is really very simple. The issue is both processor and storage: Take the recommendations of running 128MB up to 256MB virtual machines. And from experience, on a G5 processor, running any kind of Linux (Suse, RedHat, etc), an idle server requires about .3 percent of a processor. This is measured, and the results are on linuxvm.com if you are interested in seeing the numbers.

Idle Linux Processor Analysis: Take David Boyes published numbers of running 9000 linux servers on one system. If one linux server requires .3 percent of a system, forget about real work, what does 9000 servers require? How about a 30-way system? Oh, sure, that sounds cost effective, doesn't it? It's understood that a server doing work will consume processing resources, but you would hope the cost of an idle server is minimal - it's NOT.

Idle Linux Storage Analysis: The recommendations (that I'm pretty sure was not researched all that well, as nobody has published any data showing research), a Linux server keeps all of it's working set intact, and therefore its storage resident. The root of the problem is that Linux wakes up 100 times per second to look for work. This is ok on a dedicated server, but under VM, creates some side effects. There are a couple of issues with storage requirements under VM - VM has a history of being very efficient in terms of sharing storage and processor power. But with Linux, VM determines that the Linux guest is always active (any server that wakes up 100 times per second is active under any definition), and therefore VM leaves the guests working set intact. A "recommendation" of a 128MB virtual machine leads to a working set of 128MB - this is a very large working set, especially for an idle virtual machine. The Velocity Software experience was with a single Linux server with a defined virtual machine of 128MB that was doing a simple task and totally consuming the processor storage of our 128MB system. Now, do the arithmetic, if you were to run 9000 guests at the (IBM) recommended value, how much storage would be required to hold the working sets of these 9000 idle servers? How about a terabyte just for idle user storage? Very likely? Not....

Non-Idle Storage Analysis: Once you get beyond the problem of the idle user storage requirements, you need to size the requirements for storage for non-idle servers. Given a large virtual machine, Linux will try and cache as much of it's data as possible - on dedicated servers with slow disks, this was appropriate. But the purpose of running multiple guests under VM is to share resource. Having every Linux waste large amounts of storage caching the same storage is expensive, we're not talking PC storage here, we are talking about S/390 storage. How many active Linux servers can you support with say 2GB of memory? The arithmetic says something less than 20 at 128MB servers, unless you want to perform lots of paging I/O.

The Solution(s): There are some alternative ways to run Linux under VM. And new kernal developments currently in test will help significantly. The one development that will help the most for reducing the requirements of idle users is the 'no more jiffies'. There is a new scheduling mechanism that is more interrupt driven instead of the 100 times per second timer pop. This will reduce the processor requirement to maybe even nothing. Another benefit to Linux actually "going to sleep" is that the virtual machine will drop from queue, and VM will trim it's working set - and page it out. So the resident storage requirement of a Linux virtual machine will not be 128MB, but maybe less than 100 pages. Almost manageable.

How does David do it? When David Boyes talks about running several 1000 Linux servers under VM, he did not have the benefit of the new kernal. There's no question if he has done this or not, he has. But he doesn't often talk about how he does it, just about what he does. Having been fortunate enough to buy him a couple of beers in Prague, he gave me some hints about what he does. First of all, running large virtual machines? Nope. Doesn't do that. 8MB, maybe up to 16MB. So now you've got at least 10 times more severs fitting in one machine, but even 9000 servers times 8MB of active storage will consume a 72GB system (wouldn't you like a system with 72GB of S/390 storage?). If 10% of them were active, and the rest were paged out, now we're talking about systems that are very effective. And as many existing dedicated servers run at less than 10% utilization, this is reasonable. But there is still the problem of paging out the idle server.

How do you force idle servers to page out? The hints that David provided make sense. You need to force a service machine to go idle. VM has a scheduling function that dates back to the early HPO days of about 1983 on processors that were 15MIPs (and often running over a 1000 users in 16MB). This function is called Q-drop delay, which delays CP from performing storage analysis of a guest long enough to ensure the guest is really done. The purpose of this is to reduce the overhead associated with adding users to queue and then dropping them. This function was included in VM/XA and is still with us in z/VM - and with no published research showing if it is still valid on machines of today with workloads of today.

Forcing guests to drop from queue: Knowing that CP won't drop a guest for 300ms, and knowing that Linux wakes up every 10ms is the opportunity that David took advantage of. You have to make some changes such that the idle Linux server will drop from queue. Installations in the past have modified VM to reduce the queue drop delay from 300ms to numbers such as 5ms. (Check our website for the CP mod). There's two more tricks. The first is to change the HZ value such that Linux wakes up 12-16 times a second instead of 100. At this point, you have a working solution. A second, maybe more elegant trick is to invoke Linuxs power management. Anybody who uses a laptop at 35,000 feet knows the value of power management - that is what allows the batteries to last several hours instead of running out of juice over the Atlantic. If you could invoke power management, then Linux wakes up even less often. You just have to figure out how to do this (an exercise left for the reader).

How many idle users can we support now? I have a bet with Rob Van der Heij that we can run 100 Linux servers on a 128MB P390. Results of this bet to be posted....

Another opportunities for reducing storage: Using small virtual machines has a second benefit of eliminating the opportunity for Linux to cache data unnecessarily. But there are times that Linux will need more storage than the 8MB or 16MB. If you talk to an old time Linux or Unix administrator, they will tell you to avoid swap like the plague. Well, our technology is a bit better than a SCSI swap disk. Under VM, you could use a virtual disk. This has many benefits. First, Linux will only use the storage if it is absolutely needed to avoid overhead. But if it IS needed, it is available. When the need for additional storage is gone, VM will page the storage out to disk. Thus you have reduced your storage requirements to support work. The experience on our system was to take a task that runs 5 hours in a 128MB virtual machine, run it in a 24MB virtual machine with a 100MB swap disk in a VM Virtual Disk, and the task ran in 45 minutes. Our system went from very storage constrained and paging a lot to not paging at all and not storage constrained. Not bad, eh?

How many non-idle users can we support now? That will depend on the amount of work being performed, which is the correct limitation.....