Las razones mas comunes de fallos con el Dynamic Logical Partitioning
Pregunta
I need to add or remove processors, memory, or I/O devices from an LPAR using the HMC. Its not working and I would like to know why ? Cause
Most likely, the cause is a RMC connection failure between the HMC and the LPAR
Respuesta
Starting point for troubleshooting problems with Dynamic Logical Partitioning.
The procedures listed below apply to Power4, Power5, & Power6 HMCs
The most common reason is due to the RMC connection failure between the HMC and the LPAR.
The first place to check is the HMC by using query commands within the HMC restricted shell command prompt.
# lspartition -dlpar
If there is no output at all, then there is an RMC problem affecting all lpars attached to this particular HMC. If this happens it is OK to close all Serviceable Events (under Service Focal Point) and reboot the HMC
# hmcshutdown -r -t now
Once the HMC reboots, wait about 15 minutes and re-run
# lspartition -dlpar
If still no output, then it would be recommended open a call with tech support.
In order for RMC to work, port 657 upd/tcp must have to be open in both directions between the HMC public interface and the lpars.
Check the partition in question. For dlpar to function, the partition must be returned, the partition must return with the correct IP of the lpar. The active value must be higher than zero, and the decaps value must be higher 0x0
Example of a working lpar
<#1> Partition:<11*9117-570*10XXXX, correct_hostname.domain, correct_ip> Active:<1>, OS:<AIX, 5.3, 5.3>, DCaps:<0x3f>, CmdCaps:<0xb, 0xb>, PinnedMem:<146>
Example of non-working lpar
<#9> Partition:<10*9117-570*10XXXX, hostname, ip> Active:<0>, OS:<, , >, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<0>
If you see the condition in the above second example (and dlpar is working for other lpars on this HMC)next step is to check the RMC status from the lpar (AIX root access will be needed).
lssrc -a | grep rsct ctcas rsct inoperative ctrmc rsct inoperative IBM.ERRM rsct_rm inoperative IBM.HostRM rsct_rm inoperative IBM.ServiceRM rsct_rm inoperative IBM.CSMAgentRM rsct_rm inoperative IBM.DRM rsct_rm inoperative IBM.AuditRM rsct_rm inoperative IBM.LPRM rsct_rm inoperative
This example output shows that all the RSCT daemons are inoperative. In many cases, some active and some missing. The key component daemon for dynamic logical partitioning is IBM.DRM
Update 1/13/2011* Beginning with csm.client 1.7.1.0, IBM.DRM will only become active when it its needed. After doing the rmcctrl or the recfgct commands discussed below, IBM.DRM if successfully starting, will only remain active for five to ten minutes before it stops. The best method to check for a good RMC connection from the lpar is to run “lsrsrc IBM.ManagementServer” AFTER recycling CTRMC with rmcctrl or rebuilding with recfgct. The output will return a “resource” for each HMC or other type of management server, such as a CSM server.
# lsrsrc IBM.ManagementServer Resource Persistent Attributes for IBM.ManagementServer resource 1: Name = "9.3.55.192" Hostname = "9.3.55.192" ManagerType = "HMC" LocalHostname = "9.3.55.166" ClusterTM = "9078-160" ClusterSNum = "" ActivePeerDomain = "" NodeNameList = {"myhost"} resource 2: Name = "9.3.55.193" Hostname = "9.3.55.193" ManagerType = "HMC" LocalHostname = "9.3.55.166" ClusterTM = "9078-160" ClusterSNum = "" ActivePeerDomain = "" NodeNameList = {"myhost"}
An appropriate way to stop and start RMC without erasing the configuration would be using the following commands.
# /usr/sbin/rsct/bin/rmcctrl -z # /usr/sbin/rsct/bin/rmcctrl -A # /usr/sbin/rsct/bin/rmcctrl -p
Check the daemon states.
# lssrc -a | grep rsct
Is IBM.DRM active now? If so, the problem may have been resolved.
Go back to the HMC restricted shell command prompt
# lspartition -dlpar the partition shows correct hostname & IP Active<1> and Decaps value 0x3f
The above values mean that the partition is capable of a dlpar operation.
* Other notes * an lpar cloned from a mksysb may still have the RMC configuration from the mksysb source. In this case, IBM.DRM is shown as active.
Using the recfgct command*
recfgct deletes the RMC database, does a discovery, and recreates the RMC configuration.
In many cases where the lpars were not already configured for the specific purposes, recfgct may be safe to use on those nodes. There are cases where you would not use recfgct. One of the cases may be if the LPAR is a CSM Management Server or the LPAR has RMC Virtual Shared Disks (VSDs). VSDs are usually only found in very large GPFS clusters. If you are using VSDs, then these filesets would be installed on your AIX system: rsct.vsd.cmds, rsct.vsd.rvsd, rsct.vsd.vsdd, and rsct.vsd.vsdrm
# lslpp -L | grep vsd
If no output, then you are not using VSDs
The other rarely used application that can be interrupted by recfgct, but without significant consequences, is if the node is a CSM Manager node or CSM client node. All AIX lpars should have these filesets
# lslpp -L | grep csm csm.client 1.7.0.10 C F Cluster Systems Management csm.core 1.7.0.10 C F Cluster Systems Management csm.deploy 1.7.0.10 C F Cluster Systems Management csm.diagnostics 1.7.0.10 C F Cluster Systems Management csm.dsh 1.7.0.10 C F Cluster Systems Management Dsh csm.gui.dcem 1.7.0.10 C F Distributed Command Execution
If you have additional filesets that start with csm, such as csm.server, csm.hpsnm, csm.ll, csm.gpfs, then you may have an LPAR that is part of a larger CSM cluster. The csm.server fileset should only be installed on a CSM Management Server. Following details a few additional checks you can perform to see if you have a Management Server configured.
# csmconfig -L ---> csmconfig not found, this is not a csm server # lsrsrc IBM.ManagementServer
This will list resources that manage the lpar, including the HMC and/or a csm server Look at the Manager Type field Manager Type = CSM — this is a csm node
So if it turns the node is a csm manager, then you would have to re-add all the nodes. If the system was a csm client node, then you would need to get onto the manager server and re-add the node.
Thats it for the warnings on recfgct. If you think you might be using VSDs and/or a CSM cluster, but are not sure, then please open a pmr and support can assist in you in determining this.
If it is unsure whether VSDs and/or a CSM cluster then please open a pmr and support can assist in determining this.
Assuming you have not reason to be concerned about the warning discussed above, then proceed.
# /usr/sbin/rsct/install/bin/recfgct
Wait several minutes
# lssrc -a | grep rsct
If you see IBM.DRM active, then you have probably resolved the issue
# lsrsrc IBM.ManagementServer
Check whether the output has this entry
ManagerType=HMC
Try the dlpar operation again. If it fails, then you will likely need to open a software PMR.
The other main reason for a dlpar failure is that the lpar has reached its minimum or maximum (on processors or memory)
Note. The partition profile does not give a true picture of the current running configuration. If the profile was edited, but the partition did not go down into a “not activated” state, then reactivated, then the profile edits have not been read.
To check the current “running configuration” check the Partition Properties instead of the profile properties. You will see the min, max, & current. You can not remove or add processors and memory that are not within these boundaries. The command to check the running properties from the HMC restricted shell listed here
# lssyscfg -r sys -F name
(you need the value of name for use with the -m flag on many HMC commands)
# lshwres -r proc -m <server_name> --level lpar
(this list just the lpars settings)
# lshwres -r proc -m <server_name> --level sys
(this list the entire servers memory settings)
If you are checking for memory, replace “proc” in the above commands with “mem”
DLPAR can fail for many reasons, and it may be necessary to contact Remote Technical Support. However, the above may solve your problem.