Major BUG Hit Cisco UCS Chassis – Beware before Updating Firmware 2.0 (4a)

When we have been notified about the EquipmentChassisThermalThresholdCritical BUG which has been tracked with the ID CSCtx52556, we have been told by TAC that, the permanent issue has been fixed on the next release of the Firmware or “later”, which is 2.0 (3c).

So, we thought to update it to the latest release which is 2.0 (4a) which of course has the other fix as well. Well, we started the activity over the week end to avoid any downtime of our core infrastructure. But then I always say expect the unexpected 🙂

So, as a proactive approach, we started updating the firmware only to a single endpoint and not to the entire endpoints at a time. We started with CIMC and then Adapter. After that we started with IOM for all of our Chassis.

But as I always fear about expect the unexpected, the update process got stuck at 66% for two Chassis and their IO Modules. I thought it might take some more time and it will come back up fine, but even after waiting for it for 2 hours it did not move a bit. This is what I have seen in the Faults Tab.

<faultInst
ack=”no”
cause=”poll-update-status-failed”
changeSet=””

code=”F16654″
created=”2012-09-23T09:24:06″
descr=”[FSM:STAGE:RETRY:]: waiting for IOM update(FSM-STAGE:sam:dme:MgmtControllerUpdateIOM:PollUpdateStatus)”
dn=”sys/chassis-3/slot-2/mgmt/fault-F16654″
highestSeverity=”warning”
id=”407554″
lastTransition=”2012-09-23T10:37:26″
lc=””
occur=”14″
origSeverity=”warning”
prevSeverity=”cleared”

rule=”fsm-updateiom”
severity=”warning”
status=”created”
tags=”fsmstageretry”
type=”fsm”>
</faultInst>

As a basic troubleshooting step I did a software reset of the IOM, but it did not help me. Then I took help from Hands and Feet and did a hardware reset of those IOM and it still did not help. It was kind of going through a continuous loop. First it shows me it is Inaccessible and then after some time it says Identify and it keep on doing the same thing.

We tried to swap the IOM in different bay and try to see if that helps. This also did not help us and thrown Fabric Port Problem Inoperable error.

By now I understood this is something serious and not usual. We filed a SR with TAC and got some one extremely helpful. I must admit Cisco TAC are so sesnsible and they act upon the situation pretty quickly.

They listened to me and understood what we have already done and what could be the issue. They took some time, discussed it internally with their engineering team and came up to me and say those three magical words “it’s a BUG”. It is recently filed, so it is still not up yet, but so soon it will be updated.

http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCuc15009

At this time if run the command below, you will see the cluster is unstable if those Chassis are responsible to provide the cluster capability.

# show cluster extended-state
Cluster Id: 0xfbd07576c07611e1-0x8919547fee9a4a04

Start time: Sun Sep 23 12:26:21 2012
Last election time: Sun Sep 23 12:40:01 2012

A: UP, PRIMARY
B: UP, SUBORDINATE

A: memb state UP, lead state PRIMARY, mgmt services state: UP
B: memb state UP, lead state SUBORDINATE, mgmt services state: UP
heartbeat state PRIMARY_OK

INTERNAL NETWORK INTERFACES:
eth1, UP
eth2, UP

HA DOWNGRADED
HA not ready on peer Fabric Interconnect
Detailed state of the device selected for HA storage:
Chassis 1, serial: XXX, state: active with errors
Chassis 2, serial: XXX, state: active
Chassis 3, serial: XXX, state: active with errors

Fabric B, chassis-seeprom local IO failure:
XXX OPEN_FAILED, error: GENERAL, error code: -1, error count: 370
Fabric B, chassis-seeprom local IO failure:
XXX OPEN_FAILED, error: GENERAL, error code: -1, error count: 373
Warning: there are pending I/O errors on one or more devices, failover may not complete

As a safety measur, we have failed over the cluster to the other Chassis so that we can easily take the corrective action.

V001FI48-A# connect local-mgmt
V001FI48-A(local-mgmt)# show cluster state
Cluster Id: 0xfbd07576c07611e1-0x8919547fee9a4a04

A: UP, PRIMARY
B: UP, SUBORDINATE

HA READY
V001FI48-A(local-mgmt)# cluster lead b
Cluster Id: 0xfbd07576c07611e1-0x8919547fee9a4a04
V001FI48-A(local-mgmt)#

It will fail it over to the other subordinate FIC and will disconnect and we had to login to the SSH session again. Now when check it has failed it over to the other two good Chassis :). But our original problem is still there.

# show cluster extended-state
Cluster Id: 0xfbd07576c07611e1-0x8919547fee9a4a04

Start time: Sun Sep 23 12:39:58 2012
Last election time: Sun Sep 23 12:53:31 2012

B: UP, PRIMARY
A: UP, SUBORDINATE

B: memb state UP, lead state PRIMARY, mgmt services state: UP
A: memb state UP, lead state SUBORDINATE, mgmt services state: UP
heartbeat state PRIMARY_OK

INTERNAL NETWORK INTERFACES:
eth1, UP
eth2, UP

HA READY
Detailed state of the device selected for HA storage:
Chassis 1, serial: XXX, state: inactive
Chassis 2, serial: XXX, state: active
Chassis 3, serial: XXX, state: inactive

The next steps are Engineering steps and performed by them. Though I know each command but then I am not allowed to reveal those here. I can tell you what happened and how they have resolved it.

  1. They have asked us to upload a Debug Firmware for the same release version to UCSM.
  2. They logged in to the SSH and uploaded the debug image to the volatile memory and logged in to the emergency mode.
  3. They knew the issue and that is when FIC is pushing the Firmware image to the IOM, it mount a /altflash location within the IOM and start receiving the update images and start the installation.
  4. The /altflash mount location was filled up 96% and because of that it was going through in loop state.
  5. They have removed two files from the mount point to free up space from that and it triggered the software install process.

# show_swupdate_progress
Update progress for Image 1:

Package being processed:  basepkg.sh
Stage:                    Installing package
Completed packages:       0/7
Completed bytes:          0/34945130

# show_swupdate_progress
Update progress for Image 1:

Package being processed:  ciscowoodsidepkg.sh
Stage:                    Installing package
Completed packages:       1/7
Completed bytes:          3254550/34945130

# show_swupdate_progress
Update progress for Image 1:

Package being processed:  cmcapppkg.sh
Stage:                    Installing package
Completed packages:       2/7
Completed bytes:          21052595/34945130

#show_swupdate_progress

Update progress for Image 1:

Package being processed:  uImage.bin
Stage:                    Installing package
Completed packages:       6/7
Completed bytes:          32073727/34945130

# show_swupdate_progress

Installation on Image 1 succeeded or not in progress

Once you see the last message, you know that the new firmware is up to date and it is now rebooting. You will again see the IOM is inaccessible and it will come up fine with all BackPlane Ports visible.

Now if you run this, you should see the similar output.

# show version
CMC Version: 2.0(4a)

After this all of your servers should start powering on (depend on the policy). I thank to TAC who saved our life but then I should say Cisco need to really pull up their socks and better align their QA team to verify each small things.

 

About Prasenjit Sarkar

Prasenjit Sarkar is a Product Manager at Oracle for their Public Cloud with primary focus on Cloud Strategy, Oracle Openstack, PaaS, Cloud Native Applications and API Platform. His primary focus is driving Oracle’s Cloud Computing business with commercial and public sector customers; helping to shape and deliver on a strategy to build broad use of Oracle’s Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings such as Compute, Storage, Java as a Service, and Database as a Service. He is also responsible for developing public/private cloud integration strategies, customer’s Cloud Computing architecture vision, future state architectures, and implementable architecture roadmaps in the context of the public, private, and hybrid cloud computing solutions Oracle can offer.

2 thoughts on “Major BUG Hit Cisco UCS Chassis – Beware before Updating Firmware 2.0 (4a)

  1. Pingback: Cisco UCS Chassis IOM Bug – Engineering Update - Stretch Cloud - Technology Undressed

  2. Pingback: The Data Center Journal Oracle, Xsigo, VMware, Nicira, SDN and IOV: IO IO its off to work they go