Hardware: Cisco UCS Memory Bug B250 Blades

We started receiving DIMM (RAM) errors in our UCS environment about a week ago. This has only occurred on our B250-M1 blades. The error could be found both in the UCS System Manager (Java GUI) and from within the CLI (Command Line Interface).

Initial RAM Error - Server 1

“scope server” help output via CLI

ucs_6120-A# scope server ?
  WORD          <chassis-id>/<blade-id>
  dynamic-uuid  Dynamic UUID

Since the problem is currently showing up on chassis 2 blade 1, we set the scope for it below

ucs_6120-A# scope server 2/1

Now we use “show sel” to view the event log, filtering the output with “include”. sel = System Event Log

ucs_6120-A /chassis/server # show sel | include error

4ef | 04/15/2010 10:40:28 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 0, DIMM Socket: 4, Channel: A, Socket: 1 | Asserted
4f1 | 04/15/2010 10:40:29 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 1, DIMM Socket: 5, Channel: A, Socket: 1 | Asserted
4f3 | 04/15/2010 10:40:30 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 2, DIMM Socket: 6, Channel: A, Socket: 1 | Asserted
4f5 | 04/15/2010 10:40:31 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 3, DIMM Socket: 7, Channel: A, Socket: 1 | Asserted

So from the messages above, you can see that there is definitely an issue. I originally thought that this was an actual hardware failure. The server never had a problem and continued ot run. At first only two DIMMs showed errors, but about 3 hours later two more DIMMs errored.

To be safe, I disassociated the service profile from this blade and moved it to a spare. About 30 minutes after the service profile was booted on the new blade DIMM errors showed up on it. This definitely hinted to a bug.

Initial RAM Error - Server 3

It turns out that this is a bug in UCS firmware 1.2

Bug ID: CSCtg34032
URL: Cisco Bug Toolkit (requires Cisco login)
Problem: Voltage on DIMM droping below 0.71 volt threshold
Resolution: Upgrade to v1.3 firmware

Interestingly Cisco shows the condition to be “Fully populated DIMM slots with X5570 CPU on B250 blade” and with “2 X5570 CPUs”.  In our case they are right with the CPU count, but we are not fully populated with DIMMs. We are only at 1/2 capacity.

I did upgrade the firmware to version 1.3(1c) and the errors went away.  Per cisco this is a “cosmetic” error and did not actually affect the server.

Just in case you do not have access to Cisco.com to view the bug ID, the text is listed below:

Ventura: 1.2(1b): X5570:Sel Events P0V75_DDR3_P2 errors filling up.
Symptom:

Amber LED/Sel Events on B250 blades with 2 X5570 CPU.
A /chassis/server # show sel
17 | | Voltage P0V75_DDR3_P2 | Lower critical – going low | Asserted | Reading
0.71 < Threshold 0.71 Volts 18 | | Platform alert LED_BLADE_STATUS | LED color is amber | Asserted 19 | | Platform alert LED_BLADE_STATUS | LED color is green | Deasserted 1a | | Voltage P0V75_DDR3_P2 | Lower critical – going low | Deasserted | Reading0.73>Threshold 0.71 Volts
Conditions:

Fully populated DIMM slots with X5570 CPU on B250 blade

Workaround:

a. sel events can be cleared for server X/Y by

scope chassis X
scope server Y
clear sel
commit
You can use sel backup policy to clear it automatically.

b. threshold values can be changed by
ipmitool -H BMC_IP_ADDRESS -U user -P password -I
lanplus raw 4 0×26 20 0×36 0×00 0×47 0×44 0×00 0×51 0×55

ipmitool -H BMC_IP_ADDRESS -U user -P password -I
lanplus raw 4 0×26 19 0×36 0×00 0×47 0×44 0×00 0×51 0×55

Before making this change run the following commands and confirm the output equals these values:

ipmitool -H BMC_IP_ADDRESS -U user -P password -I lanplus raw 4 0×27 19

36 00 49 46 00 51 55

ipmitool -H BMC_IP_ADDRESS -U user -P password -II lanplus raw 4 0×27 20

36 00 49 46 00 51 55

If the output does not match please contact TAC for assistance on implementing this workaround.

Threshold needs to be set again after BMC reset.

Advertisement

~ by Kevin Goodman on June 16, 2010.

9 Responses to “Hardware: Cisco UCS Memory Bug B250 Blades”

  1. [...] the original post: Hardware: Cisco UCS Memory Bug B250 Blades « Colocation to … Tags: a-week-ago, and-linux, b250-, blades-, cisco, dimm, memory, occurred-on-our, ram, [...]

  2. [...] This post was mentioned on Twitter by Kevin Goodman and foz11, Jason Nash. Jason Nash said: RT @colovirt: New post- Hardware: Cisco UCS Memory Bug B250 Blades – http://wp.me/pm3nc-dU [...]

  3. Great post, we ran into the same issue across two blades. Too bad we just upgraded to 1.2 two weeks ago, it will be a month or two before we can get to 1.31.

  4. It is sometimes necessary to reset the BMC again after upgrading the firmware. Saw this today. Check sensor settings by logging into CLI and ‘connect bmc x/y’ where x is chassis and y is blade. Then issue ‘sensors’. The ddr3 voltage output around line 17 is what you’re looking for. Check it before upgrade and compare post upgrade to see if an extra bmc reset is necessary.

  5. hey, nice share.. i’ve planed to learn this cisco product..

    keep in share..

    regards

    dika

  6. This is also a bug in 1.3(1c). We ran into three weeks ago. There is a new firmware 1.3(1i) that may fix it, but we haven’t loaded it yet. I notice that you are going to 1.3(1c) this week. Email me for some more firmware follies.

  7. Luckily for us, the memory bug resolved itself when we went to 1.3(1c). Still working on a few other issues.

  8. Just noticed that you are running the M1 blades. We’re running the M2. The fix in 1.3(1c) was for the M1, not M2.

  9. Very good to know. Thanks for the information!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 1,031 other followers