Notes
Slide Show
Outline
1
Managing Linux on z/VM (2008)
  • Barton@VelocitySoftware.com
  • HTTP://VelocitySoftware.com
  • HTTP://LinuxVM.com
2
Topics
  • Velocity Software
  • Performance Management Infrastructure
    • Performance Analysis
    • Operational Alerts
    • Capacity Planning
    • Accounting/Charge back


  • Importance of technology
    • Z/VM technology
    • Linux (and SUN, NT, AIX, etc) Agent technology
  • Showcase Demonstration, Live


3
Velocity Software - Business
  • Founded 1988 to provide VM Performance Software and Services (Big party at SHARE, Summer 2008)
    • ESAMAP,ESAMON
    • ESATCP, ESAWEB
    • zTUNE

  • Performance Workshops, Education
    • Next performance workshop JUNE
    • Performance seminars scheduled often
    • The best marketing is education


4
Velocity Software – What we do
  • IBM Partner in Development since 1989
  • Participate in IBM's VM Early Support Programs
    • Every VM Early Support Program since 1988 (XA, ESA, z)
  • Relationship with IBM’s Linux lab in Boeblingen
  • Performance research
    • Customer problems
    • Redbooks
  • Conference participation to present research
    • SHARE
    • GSE
    • CMG
    • Local VM/Linux user groups
5
Performance Resources
6
zLinux Level Set
  • This is SHARED resource environment,
    • z/VM Performance critical
    • Any One server can impact all applications
  • This is not z/OS
    • This is not a mature environment
    • Some metrics are not yet available
  • This is not distributed Environment
    • We do not have cycles to waste
    • We DO have capacity planning, chargeback requirements


  • Tools are needed that understand the environment


7
Linux Infrastructure Requirements
  • Instrumentation Requirements
    • Performance Analysis
    • Operational Alerts
    • Capacity Planning
    • Accounting/Charge back
  • Correct data (Virtual Linux CPU data wrong)
  • Capture ratios
  • Instrumentation is NOT the performance problem
8
Infrastructure Requirements: Performance Analysis

  • Why Performance Analysis:   Service Levels.
    • Diagnose problems real time
    • Manage Shared resource environment
    • Any application may impact other applications


  • Infrastructure Requirements
    • Analyze all z/VM Subsystems in detail, real time
      • (DASD, Cache, Storage, Paging, Processor, TCPIP)
    • Analyze Linux
      • (applications, processes, processor, storage, swap)
    • Historical view of same data important
      • Why are things worse today than yesterday?
      • Did adding new workload affect overall throughput?
9
Infrastructure Requirements:
Capacity Planning

  • Why Capacity Planning: Future Service Levels
    • How many more servers can you support with existing z9?
    • What is capacity requirements for an application?
    • Avoid crises in advance
    • Consolidation Planning – Projecting requirements of the next 1000 servers


  • Infrastructure Requirements
    • Performance database (long term)
    • z/VM AND Linux data
    • Resource requirements by Server, Application, User
    • z/VM and z/Linux data must be usable by existing planners
    • Interface to MICS, MXG, CIMS, TDS


10
Infrastructure Requirements:
Accounting and Chargeback

  • Why Chargeback?
    • Distributed chargeback model is by server
    • Shared chargeback model is by resource utilized
    • Convincing customers to move applications to “z”
    • Encourages efficient/effective resource use


  • Infrastructure Requirements
    • Identify Resource by server
    • Identify Resource by Linux Application
    • High capture ratio
    • Every site does it differently, so flexible data is key



11
Infrastructure Requirements:
Operational Alerts

  • Operational Requirements
    • Operations will manage 100’s (1000’s) of servers
      • Requires active performance management
    • Alerts for processes in loops, disks 90% full, missing processes
    • One test server in a loop impacts all other servers
    • Requires active performance management


  • Infrastructure Requirements
    • Fast problem detection
    • Interface to SNMP management console (HP, IBM, CA)
    • User tailored alerts
    • Web based alerts
12
Data Requirement Summary

  • Performance data requirements
    • Valid, correct – CPU data typically wrong or very wrong.
      • Linux getting better with SLES10/RHEL5
    • z/VM and Linux data integrated?
    • Helpful in solving problems?
    • Validate benefits of tuning
  • Historical data requirements
    • Capacity Planning input
    • Problem Analysis
    • Linux
    • z/VM
  • Accounting / Charge back
    • By server, by application, by process, by Linux userid
  • Manage Infrastructure cost
    • Turning off agent solves the performance problem?
13
Velocity Linux Performance Suite
14
z/VM Performance
15
Linux Performance Data?
  • Linux (and networks) adds requirement
    • Correct data
    • Complete data
    • Low cost data


  • Support requirements:
    • z/VM 3.x, 4.x, 5.1, 5.2, 5.3, next….
    • SLES 7,8,9,10 (Installations still have 7 and 8)
    • RHEL 3,4,5
    • Other distributions
    • Other platforms
  • Must support:
    • Performance tuning
    • Capacity planning
    • Operational alerts
    • Chargeback/Accounting
16
Correct Linux Performance Data?
  • Valid and Correct?
    • Process data from Linux under z/VM is wrong
      • All process accounting based on timer ticks
      • Corrected in SLES 10, RHEL5
    • TOP, ALL other agents “lie” when under z/VM
    • Sample of factor of 10
      • Well known issue since 2001
      •  HTTP://velocitysoftware.com/present/CaseAFS


  • Leads to solving performance problems?
    • z/VM owns the shared resources
    • “Native” tools will not detect many problems
    • “performance was unexplainably bad so we abandoned the project”
    • Skills, experience and Education help…
17
Instrumentation Issues
  • Operational cost of agents
    • Does your agent use 2%? 5%? 95%? of a processor per image?
    • Does this matter on distributed servers where agents were created?
    • Will local data collection fill up your file system?
    • Does turning off performance monitoring solve the performance problem?
    • Do you only turn on your agent when you have a problem???


    • Customer quote: an agent that costs 1% of a processor will  cost me 10 IFLs
  • Agents must provide correct data
    • Is your data correct? Or wrong by order of magnitude?
    • Prior to SLES10/RHEL5, all “Virtual” agents provide wrong data
    • Why collect bad data?
18
Network, Linux Instrumentation
  • Performance Data infrastructure existed (ESAMON/ESAMAP)
    • PDB already existed for performance analysis and Capacity Planning
    • Data presentation tools existed


  • Data source needed for Linux and Network:
    • Passive agent (do not measure idle servers)
    • Low overhead (want to monitor 100 / 1000 servers under z/VM)
      • Most Agents developed for Intel did not care about overhead
    • Open Source (fast development time)
    • Standard interface


  • SNMP: Standard interface
    • TCPIP application provided by TCPIP Vendor
    • Used to collect network, host data from NT, SUN, HP
    • NETSNMP available for Linux - Meets all requirements
      • (Distributed with RHEL 3,4,5 SLES 7,8,9,10)
19
Competing Agent Technologies
  • NETSNMP
    • Default from redhat or Suse uses about 1% CPU
    • Velocity Software version uses less than .1%
    • Velocity Software version for idle server: 0.01%
    • Currently installed on >10,000  of z/Linux servers
      • (Actually, installed on all of them, but used on >10,000)
  • RMFPMS (IBM’s direction 2003)
    • Active agent, writes data to log
    • Not recommended because of overhead
  • New “Monitor Record” (IBM’s direction 2005)
    • zLinux only, non-standard
    • No process data
    • CPU data can not be corrected
    • What problem are we trying to solve?
  • Proprietary agents
    • Written for Intel or other Unix platforms, CPU cost didn’t matter
    • Can be Expensive
    • Ask for references for “z”
20
Linux and Network Data Acquisition
21
Operational Costs
  • Low cost agent - Cost of snmpd very low (.1%-.4%)
  • (Objective; Determine what process spikes at 1am Monday morning)
  • See “http://velocitysoftware.com/applic.html”  for full listing (24 linux servers)


  • Report: ESALNXA      LINUX HOST Application Report
  • ----------------------------------------------------
  • Node/    Process/    ID    <---Processor Percent--->
  • Date     Application             <Process><Children>
  • Time     name              Total sys  user syst usrt
  • -------- ----------- ----- ----- ---- ---- ---- ----
  • 00:15:57
  • LINUX16  *Totals*        0  16.9  2.5 11.6  1.9  1.1
  •          amqpcsea      674   0.4  0.1  0.3    0    0
  •          amqzxma0      600   0.8  0.1  0.7  0.0  0.0
  •          cron          473   2.1  0.2  0.2  1.7  0.0
  •          dsmc          938   0.1  0.0  0.0  0.0  0.0
  •          httpd       31993   2.8  0.2  2.5  0.0  0.1
  •          java        32066   8.0  1.3  6.7    0    0
  •          kjournal       85   0.1  0.1    0    0    0
  •          kswapd          6   0.1  0.1    0    0    0
  •          qpea         4642   0.1  0.0  0.1    0    0
  •          qpmon        4674   0.8  0.1  0.7  0.0    0
  •          snmpd         361   0.1  0.1  0.0    0    0 ß=====
  •          sshd          370   1.0  0.0    0  0.1  0.9
  • LINUX13  *Totals*        0   2.7  0.8  0.3  0.6  1.0
  •          cron          421   1.2  0.0  0.0  0.5  0.7
  •          init            1   0.2  0.0  0.0  0.0  0.1
  •          master        394   0.3  0.0  0.1  0.0  0.1
  •          ntpd          453   0.8  0.6  0.2    0    0
  • LINUX15  *Totals*        0   1.8  0.3  0.5  1.1  0.0
  •          amqzxma0      844   0.2  0.0  0.1    0    0
  •          cron          457   1.1  0.0  0.0  1.1  0.0
  •          qpmon        4726   0.1  0.0  0.1    0    0
  •          snmpd         354   0.4  0.2  0.2    0    0 ç======
22
Process Capture Ratio
  • High cpu capture ratio
  • Report: ESALNXV      LINUX Virtual Processor Analysis Report
  • -----------------------------------------------------------------
  • Node/     VM       <Linux Pct CPU> <Process  Data> Capture Prorate
  •  Name     ServerID Total Syst User Total Syst User   Ratio Factor
  • --------- -------- ----- ---- ---- ----- ---- ---- ------- ------
  • 10:03:00
  • NEALE1    LNEALE1  100.0 11.4 88.6 100.2 11.5 88.7   1.002  1.000
  • -----------------------------------------------------------------




  • Report: ESALNXP      LINUX HOST Process Statistics Report
  • ---------------------------------------------------------
  • node/     <-Process Ident-> Nice <------CPU Percents---->
  •  Name     ID    PPID   GRP  Valu  Tot  sys user syst usrt
  • --------- ----- ----- ----- ---- ---- ---- ---- ---- ----
  • 10:03:00
  • NEALE1        0     0     0    0  100 0.43 3.35 11.0 85.4
  •  kswapd0    100     1     1    0 0.12 0.12    0    0    0
  •  snmpd     1013     1  1012  -10 0.13 0.03 0.10    0    0
  •  sh        3653  3652 30124    0 52.7    0    0 9.37 43.3
  •  gmake     9751  9750 30124    0 43.4 0.02 0.02 1.37 42.0
  •  sh       10129  9751 30124    0 0.02 0.02    0    0    0
  •  sh       10130 10129 30124    0 0.63 0.03 0.23 0.28 0.08
  •  cc1      10307 10306 30124    0 3.12 0.18 2.93    0    0
  •  rpmbuild 30124 16382 30124    0 0.07 0.03 0.03    0    0
  •  sh       30125 30124 30124    0 0.02    0 0.02    0    0
  •  gmake    30126 30125 30124    0 0.02    0 0.02    0    0
23
ESALPS (Linux Performance Suite)
24
zTUNE
  • New installations lack z/VM and Linux on z/VM tuning skills
  • Velocity Software’s objective is to ensure our customer performance problems are resolved – quickly.
  • zTUNE includes configuration guidance, health checks when ever installation requests, and assistance in all areas of Linux on z/VM  and z/VM performance


25
Health Checker for z/VM, Linux:
 zTUNE
  • Focus more now on simplifying problem resolution
  • Customer reports that application people complaining about zLinux performance:


    • Report: ESATUNE      Tuning Recommendation Report
    • Monitor initialized:                      on 2084 serial 9ABED
    • ---------------------------------------------------------------
    • The following changes are suggestions by Velocity Software
    •  to enhance performance of this system.
    • However, Velocity Software takes no responsibility -
    •  all tuning is the responsibility of the installations.
    • Please call 650-964-8867 if you have any questions about
    •  these values, or suggestions on report enhancements.
    • USR2 User LINUX160 is paging excessively (75.0 per second)
    •      This user can be protected using SET RESERVED
    • SPL5 Spool utilization is 100% full.
    •      Perform Spool file analysis and purge large
    •      spool files, or force users currently writing
    •      excessively to spool.


    • *****zTUNE Evaluation   *************
    • XAC1 User total PROCESSOR WAIT excessive at 33 percent.
    •      Current reporting threshold set to 20.
    •      This is percent of inqueue time waiting for
    •      specific (PROCESSOR)resources to become available.
    • LPR3 LPAR share is too low, causing USER CPU Wait
    •      VM LPAR allocated share: 0.94 percent of total
    •      VM LPAR used 389 percent of allocated share
26
Point and click Analysis
27
Linux Storage
28
Add Enterprise Support
  • VM
  • CP Monitor
29
Linux Operational Support
  • Alerts
    • User tailorable
    • 3270 based, web based, and / or SNMP
    • Alerts can be set on any variable or calculated variable


  • Linux alert examples:
    • Disk full
    • Missing processes (requires complete data)
    • Looping processes (requires correct data)


  • z/VM alert examples
    • Page/spool space full (avoid abends)
    • Looping servers
    • DASD service times


  • Network alert examples
    • Transport errors
    • ICMP rates
    • Bandwidth thresholds
30
Linux Storage Case Study
  • Linux tries to use all real storage


  • Linux minimizes storage used for swap
    • Swap historically was slow SCSI device
    • One Vdisk experiment with linux swapped 40,000 per second


  • First case study:
    • Process took hours, system paged significantly
    • Reduced size of Linux Virtual Machine, 128mb to 24mb
    • Defined 100MB Swap disk
    • Linux reduces storage requirement
    • Process took minutes


  • Virtual Disk paged out when not in use
    • This works!!!  Paging greatly reduced, Linux performance greatly improved!!!
  • This research critical to using Collaborative Memory Mgmt (CMM)
31
LINUX Swapping to VDISK
  • Change 128MB Server to 24MB with 100MB Swap
  • Reduction of Overall Storage Requirements of 100MB
    • Unused VDISK is paged out
32
Tailoring Linux Storage
33
Analyzing Linux CPU
34
Analyzing Linux CPU
35
Analyzing Linux CPU by Application
36
Analyzing Linux CPU by Userid
37
Analyzing Linux Disks
38
Storage Map, z/VM 5.2
  • Storage map  - CAPTURE RATIOs always critical for any instrumentation:
    • CP Fixed Storage
    • CP Non Pageable
      • Free storage (only VMDBLKs)
      • Frame tables
    • Dynamic Paging Area(DPA)
      • System Execution Space
      • User storage, MDC, Address Space, Vdisk
      • Available List (greater/less than 2gb)



  • Report: ESASTR1      Main Storage Analysis               Velocity Software, Inc.       ESAMAP 3.6.0 05/15/06  Page  57
  • Monitor initialized: 06/06/05 at 08:42:16 on 2064 serial 11542     First record analyzed: 06/06/05 08:42:42
  • -------------------------------------------------------------------------------------------------------------------------
  •          Users <-----------------------------Pages---------------------------------------------------------->
  •          Loggd System  Fixed Non-  Free Frame <Available> Systm  User  NSS/DCSS  <-AddSpace> VDISK <MDC> Diag    Capt-
  • Time        On Storage Store Pgble Stor Table <2gb  >2gb  ExSpc Resdnt Resident  Systm User  Rsdnt Rsdnt  98     Ratio
  • -------- ----- ------- ----- ----- ---- ----- ----- ----- ----- ------ --------  ----- ----- ----- ----- ----    -----
  • 08:45:42    22 7864304  2907  3816    5 61440  513K 6292K 33150 778370     8408   1090  6235     0  133K  333    0.995
  • **************************************************Summary**********************************************************
  • Average:    22 7864304  2907  3816    5 61440  513K 6292K 33150 778370     8408   1090  6235     0  133K  333    0.995
39
ESALPS Measurement Summary
  • ESALPS Meets Data Requirements:
    • Sufficient for performance, capacity planning, accounting, Operations
    • Linux and z/VM data – Integrated
    • Complete and correct data
  • ESALPS Meets Infrastructural requirements
    • Support all releases (SLES7,8,9,10, RHEL 3,4,5, z/VM V3,4,5…)
    • Standard interfaces
    • Low resource requirements


  • ESALPS References (many):
    • Many installations instrument hundreds of servers today on single LPARs


  • zTUNE (Health Check for z/VM, Linux)
    • zTUNE “http://velocitysoftware.com/products.html”

  • Performance Education:
    • Performance education, see: “http://velocitysoftware.com/workshop.html”