The most common mistake installations installing VM and Linux
servers under VM is to over commit resources. (See
Storage hints and tips).
Thrashing
Thrashing is a situation caused by servers wanting more storage
all at one time than exists on a box. At the point where VM does
not have enough storage to meet each server's needs, VM starts
paging. There is a point where VM could spend more time doing
paging than performing work. This point is what is meant by
thrashing.
The VM scheduler has a very sophisticated mechanism to
stop users from thrashing. It will control thrashing from
three different perspectives, storage, paging and processor.
The mechanism for controlling thrashing is to put users on
an eligible list - meaning these users want to consume resource,
but there is not enough resource to sustain them, so they will
not be dispatched. When there is enough resource or certain
deadlines pass, the user will be dispatched. This can look
like users are 'hung' - they are NOT hung, they are waiting until
there is sufficient resource to run efficiently.
SRM Controls
There are three controls called
SRM controls and these are user tailorable with the
'SET SRM STORBUF', 'SET SRM LDUBUF', and the 'SET SRM DISPBUF'
commands.
SET SRM DSPBUF
There have been many recommendations for each of these that is not practical upon analysis. When
VM first implemented this scheduler, the immediate impact for
some installations was very high eligible lists. It turns out
the STORBUF control did not account for expanded storage. The
immediate rule of thumb was to increase STORBUF by the amount
of expanded storage. This rule of thumb then was changed over
time with little understanding of what was being accomplished.
Most systems perform well with the default STORBUF settings.
SET SRM XSTORE
A SET SRM XSTORE command was added that allows a user to
also tell the STORBUF function there is more storage - and
setting this to 50% if and when you have expanded storage
became a reasonable guideline.
SET SRM LDUBUF
If you have eligible lists, and you do not have expanded
storage, you are likely paging a lot. This is when the LDUBUF
control takes affect. The scheduler controls how many users are
allowed to be concurrently paging their working sets into storage
based on how many paging devices defined. In this architecture,
a paging device can only support one concurrent page in at a time,
others requests for that device wait on a queue. If too many
page requests queue up, you will find VM spending more time doing
paging than working - this would be a thrashing situation.
The default LDUBUF encourages a thrashing situation. If you
are thrashing and you raise the LDUBUF, you will in effect reduce
the amount of work you will get done. Several case studies
showed this, see one of Velocity Software's performance daze
handouts if you can find one for one of these.
SET SRM DSPBUF
The last control is DSPBUF. If you are brave (or foolish) enough
to play with this without understanding what it does and without
tools to evaluate real time it's impact, please ensure you have
set QUICKDSP to your own user first. This command can have
the impact of modifying your operating environment very quickly
and can force an eligible list when there is no need for one. So
quickly, that you will likely find your own user(s) on the eligible
list and at that point unable to enter commands.
Processor
SET QUICKDSP
Assuming you retain the use of the eligible list to stop your
system from thrashing, you should use the 'SET QUICKDSP' option
for your important servers. This option is set for a service machine
and allows the server to bypass all governors set by the CP
scheduler. In this way, the server is always dispatched when
necessary, even under a thrashing situation. Users such as TCPIP,
and your more important production linux servers should have this
option. Your test servers should not.