VM Scheduler Tips

The most common mistake installations installing VM and Linux servers under VM is to over commit resources. (See Storage hints and tips).

Thrashing Thrashing is a situation caused by servers wanting more storage all at one time than exists on a box. At the point where VM does not have enough storage to meet each server's needs, VM starts paging. There is a point where VM could spend more time doing paging than performing work. This point is what is meant by thrashing.

The VM scheduler has a very sophisticated mechanism to stop users from thrashing. It will control thrashing from three different perspectives, storage, paging and processor.

The mechanism for controlling thrashing is to put users on an eligible list - meaning these users want to consume resource, but there is not enough resource to sustain them, so they will not be dispatched. When there is enough resource or certain deadlines pass, the user will be dispatched. This can look like users are 'hung' - they are NOT hung, they are waiting until there is sufficient resource to run efficiently.

SRM Controls There are three controls called SRM controls and these are user tailorable with the 'SET SRM STORBUF', 'SET SRM LDUBUF', and the 'SET SRM DISPBUF' commands.

SET SRM DSPBUF There have been many recommendations for each of these that is not practical upon analysis. When VM first implemented this scheduler, the immediate impact for some installations was very high eligible lists. It turns out the STORBUF control did not account for expanded storage. The immediate rule of thumb was to increase STORBUF by the amount of expanded storage. This rule of thumb then was changed over time with little understanding of what was being accomplished. Most systems perform well with the default STORBUF settings.

SET SRM XSTORE A SET SRM XSTORE command was added that allows a user to also tell the STORBUF function there is more storage - and setting this to 50% if and when you have expanded storage became a reasonable guideline.

SET SRM LDUBUF If you have eligible lists, and you do not have expanded storage, you are likely paging a lot. This is when the LDUBUF control takes affect. The scheduler controls how many users are allowed to be concurrently paging their working sets into storage based on how many paging devices defined. In this architecture, a paging device can only support one concurrent page in at a time, others requests for that device wait on a queue. If too many page requests queue up, you will find VM spending more time doing paging than working - this would be a thrashing situation.

The default LDUBUF encourages a thrashing situation. If you are thrashing and you raise the LDUBUF, you will in effect reduce the amount of work you will get done. Several case studies showed this, see one of Velocity Software's performance daze handouts if you can find one for one of these.

SET SRM DSPBUF The last control is DSPBUF. If you are brave (or foolish) enough to play with this without understanding what it does and without tools to evaluate real time it's impact, please ensure you have set QUICKDSP to your own user first. This command can have the impact of modifying your operating environment very quickly and can force an eligible list when there is no need for one. So quickly, that you will likely find your own user(s) on the eligible list and at that point unable to enter commands. Processor

SET QUICKDSP Assuming you retain the use of the eligible list to stop your system from thrashing, you should use the 'SET QUICKDSP' option for your important servers. This option is set for a service machine and allows the server to bypass all governors set by the CP scheduler. In this way, the server is always dispatched when necessary, even under a thrashing situation. Users such as TCPIP, and your more important production linux servers should have this option. Your test servers should not.

Recomendations In most cases, the defaults have proven VERY effective at managing your system's resources. Modifying these commands is very often detrimental to performance. Using QUICKDSP and scheduling SHAREs is almost always a better approach than the quick fix.