EquipmentChassisThermalThresholdCritical – BUG Hit Cisco UCS Chassis

Two weeks back I was working on a DC deployment where in we have large number of Cisco UCS Chassis carrying all of our Core Infrastructure services, like, Active Directory, Exchange Servers, SAP Application Servers running on top of RHEL, Backup Servers etc.

After carefully looking at the power and cooling option in the DC and choosing the N+1 Grid power, we started doing the UAT (User Acceptance Testing). One of the test was to check what happen when we pull out IOM Modules and Fans from the Chassis.

Well, by now, we did not know that there is a BUG around it and it can hit us as well. It is of course not there in any release notes. So let me tell you what is that BUG and how can it hunt you down.

Once you remove the IOM and then push it back to it’s position, you may observe that there is a Critical Alert popping in for almost all of your Chassis where in you did the same testing. It would look like as below:

Now looking at this any one can think of an issue with their DC Power and Cooling and I am not a super human who get dream of a BUG in sleep 🙂

So, we started investigating around the DC Power and Cooling and came up to a point that there is no problem at all, at least from these two perspectives. We tried to search the threshold range as well, but could not find (any one out there to help me on this ?).

At the end we thought of looking at the logs. I looked in to the logs and found that one IOM is seeing the fans as unknown/missing and the other IOM as “OK”. Understood from TAC that there are known issues identified due to multiple fans being pulled out at the same time or one of the IOM having a stuck thermal condition.

For chassis 3:

IOM 1 :

maxfans:                             8
fan[1].fault/read/req:   3/0/100                # UNKNOWN
fan[2].fault/read/req:   3/0/100                # UNKNOWN
fan[3].fault/read/req:   3/0/100                # UNKNOWN
fan[4].fault/read/req:   3/0/100                # UNKNOWN
fan[5].fault/read/req:   1/0/100                # MISSING
fan[6].fault/read/req:   1/0/100                # MISSING
fan[7].fault/read/req:   1/0/100                # MISSING
fan[8].fault/read/req:   3/0/100                # UNKNOWN
nblades:              8
blade[1].present/policy_state: 2/1         # PRESENT/COOL
blade[2].present/policy_state: 2/1         # PRESENT/COOL
blade[3].present/policy_state: 2/1         # PRESENT/COOL
blade[4].present/policy_state: 2/1         # PRESENT/COOL
blade[5].present/policy_state: 2/1         # PRESENT/COOL
blade[6].present/policy_state: 2/1         # PRESENT/COOL
blade[7].present/policy_state: 2/1         # PRESENT/COOL
blade[8].present/policy_state: 2/1         # PRESENT/COOL
IOM.RWTEMPB: 44
IOM_THERM: 1  # COOL
PEER_STATUS:                   1                              # ACTIVE
PEER_IOM_THERM: 1      # COOL

IOM 2:

maxfans:                             8
fan[1].fault/read/req:   0/30/30                # OK
fan[2].fault/read/req:   0/30/30                # OK
fan[3].fault/read/req:   0/30/30                # OK
fan[4].fault/read/req:   0/30/30                # OK
fan[5].fault/read/req:   0/30/30                # OK
fan[6].fault/read/req:   0/30/30                # OK
fan[7].fault/read/req:   0/30/30                # OK
fan[8].fault/read/req:   0/30/30                # OK
nblades:              8
blade[1].present/policy_state: 2/1         # PRESENT/COOL
blade[2].present/policy_state: 2/1         # PRESENT/COOL
blade[3].present/policy_state: 2/1         # PRESENT/COOL
blade[4].present/policy_state: 2/1         # PRESENT/COOL
blade[5].present/policy_state: 2/1         # PRESENT/COOL
blade[6].present/policy_state: 2/1         # PRESENT/COOL
blade[7].present/policy_state: 2/1         # PRESENT/COOL
blade[8].present/policy_state: 2/1         # PRESENT/COOL
IOM.RWTEMPB: 45
IOM_THERM: 1  # COOL
PEER_STATUS:                   2                              # PASSIVE
PEER_IOM_THERM: 1      # COOL

Also understood from TAC that the issue matches the defect CSCtx52556 and we can see that the bug has been duplicated by so many other known defects involving transient thermal/fan issues. The workaround for this issue is to reseat the IO module. You can find the details of bug in the below link.

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?caller=pluginredirector&method=fetchBugDetails&bugId=CSCtx52556

They also confirmed that the permanent fix for this defect is incorporated in the 2.0(3c) version where as we were running Firmware version 2.0 (3a).

Finally we have upgraded the firmware version to 2.0 (4a) which has been released on 18th September and it has resolved the issue.

 

About Prasenjit Sarkar

Prasenjit Sarkar is a Product Manager at Oracle for their Public Cloud with primary focus on Cloud Strategy, Oracle Openstack, PaaS, Cloud Native Applications and API Platform. His primary focus is driving Oracle’s Cloud Computing business with commercial and public sector customers; helping to shape and deliver on a strategy to build broad use of Oracle’s Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings such as Compute, Storage, Java as a Service, and Database as a Service. He is also responsible for developing public/private cloud integration strategies, customer’s Cloud Computing architecture vision, future state architectures, and implementable architecture roadmaps in the context of the public, private, and hybrid cloud computing solutions Oracle can offer.