Hardware: Cisco UCS Memory Bug B250 Blades
We started receiving DIMM (RAM) errors in our UCS environment about a week ago. This has only occurred on our B250-M1 blades. The error could be found both in the UCS System Manager (Java GUI) and from within the CLI (Command Line Interface).
“scope server” help output via CLI
ucs_6120-A# scope server ? WORD <chassis-id>/<blade-id> dynamic-uuid Dynamic UUID
Since the problem is currently showing up on chassis 2 blade 1, we set the scope for it below
ucs_6120-A# scope server 2/1
Now we use “show sel” to view the event log, filtering the output with “include”. sel = System Event Log
ucs_6120-A /chassis/server # show sel | include error
4ef | 04/15/2010 10:40:28 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 0, DIMM Socket: 4, Channel: A, Socket: 1 | Asserted
4f1 | 04/15/2010 10:40:29 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 1, DIMM Socket: 5, Channel: A, Socket: 1 | Asserted
4f3 | 04/15/2010 10:40:30 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 2, DIMM Socket: 6, Channel: A, Socket: 1 | Asserted
4f5 | 04/15/2010 10:40:31 | Memory 0×02 | Uncorrectable ECC/other uncorrectable memory error | Rank: 3, DIMM Socket: 7, Channel: A, Socket: 1 | Asserted
So from the messages above, you can see that there is definitely an issue. I originally thought that this was an actual hardware failure. The server never had a problem and continued ot run. At first only two DIMMs showed errors, but about 3 hours later two more DIMMs errored.
To be safe, I disassociated the service profile from this blade and moved it to a spare. About 30 minutes after the service profile was booted on the new blade DIMM errors showed up on it. This definitely hinted to a bug.
It turns out that this is a bug in UCS firmware 1.2
| Bug ID: | CSCtg34032 |
| URL: | Cisco Bug Toolkit (requires Cisco login) |
| Problem: | Voltage on DIMM droping below 0.71 volt threshold |
| Resolution: | Upgrade to v1.3 firmware |
Interestingly Cisco shows the condition to be “Fully populated DIMM slots with X5570 CPU on B250 blade” and with “2 X5570 CPUs”. In our case they are right with the CPU count, but we are not fully populated with DIMMs. We are only at 1/2 capacity.
I did upgrade the firmware to version 1.3(1c) and the errors went away. Per cisco this is a “cosmetic” error and did not actually affect the server.
Just in case you do not have access to Cisco.com to view the bug ID, the text is listed below:
Ventura: 1.2(1b): X5570:Sel Events P0V75_DDR3_P2 errors filling up.
Symptom:Amber LED/Sel Events on B250 blades with 2 X5570 CPU.
A /chassis/server # show sel
17 | | Voltage P0V75_DDR3_P2 | Lower critical – going low | Asserted | Reading
0.71 < Threshold 0.71 Volts 18 | | Platform alert LED_BLADE_STATUS | LED color is amber | Asserted 19 | | Platform alert LED_BLADE_STATUS | LED color is green | Deasserted 1a | | Voltage P0V75_DDR3_P2 | Lower critical – going low | Deasserted | Reading0.73>Threshold 0.71 Volts
Conditions:Fully populated DIMM slots with X5570 CPU on B250 blade
Workaround:
a. sel events can be cleared for server X/Y by
scope chassis X
scope server Y
clear sel
commit
You can use sel backup policy to clear it automatically.b. threshold values can be changed by
ipmitool -H BMC_IP_ADDRESS -U user -P password -I
lanplus raw 4 0×26 20 0×36 0×00 0×47 0×44 0×00 0×51 0×55ipmitool -H BMC_IP_ADDRESS -U user -P password -I
lanplus raw 4 0×26 19 0×36 0×00 0×47 0×44 0×00 0×51 0×55Before making this change run the following commands and confirm the output equals these values:
ipmitool -H BMC_IP_ADDRESS -U user -P password -I lanplus raw 4 0×27 19
36 00 49 46 00 51 55
ipmitool -H BMC_IP_ADDRESS -U user -P password -II lanplus raw 4 0×27 20
36 00 49 46 00 51 55
If the output does not match please contact TAC for assistance on implementing this workaround.
Threshold needs to be set again after BMC reset.



[...] the original post: Hardware: Cisco UCS Memory Bug B250 Blades « Colocation to … Tags: a-week-ago, and-linux, b250-, blades-, cisco, dimm, memory, occurred-on-our, ram, [...]
Hardware: Cisco UCS Memory Bug B250 Blades « Colocation to … | Wordpress Blog Hosting said this on June 16, 2010 at 2:57 pm
[...] This post was mentioned on Twitter by Kevin Goodman and foz11, Jason Nash. Jason Nash said: RT @colovirt: New post- Hardware: Cisco UCS Memory Bug B250 Blades – http://wp.me/pm3nc-dU [...]
Tweets that mention Hardware: Cisco UCS Memory Bug B250 Blades « Colocation to Virtualization -- Topsy.com said this on June 16, 2010 at 2:59 pm
Great post, we ran into the same issue across two blades. Too bad we just upgraded to 1.2 two weeks ago, it will be a month or two before we can get to 1.31.
Mike Hurst said this on June 16, 2010 at 6:52 pm
It is sometimes necessary to reset the BMC again after upgrading the firmware. Saw this today. Check sensor settings by logging into CLI and ‘connect bmc x/y’ where x is chassis and y is blade. Then issue ‘sensors’. The ddr3 voltage output around line 17 is what you’re looking for. Check it before upgrade and compare post upgrade to see if an extra bmc reset is necessary.
Justin Walther said this on June 17, 2010 at 8:11 pm
hey, nice share.. i’ve planed to learn this cisco product..
keep in share..
regards
dika
dika said this on June 29, 2010 at 3:57 am
This is also a bug in 1.3(1c). We ran into three weeks ago. There is a new firmware 1.3(1i) that may fix it, but we haven’t loaded it yet. I notice that you are going to 1.3(1c) this week. Email me for some more firmware follies.
Adam Baum said this on August 3, 2010 at 11:44 pm
Luckily for us, the memory bug resolved itself when we went to 1.3(1c). Still working on a few other issues.
Kevin Goodman said this on August 4, 2010 at 2:04 pm
Just noticed that you are running the M1 blades. We’re running the M2. The fix in 1.3(1c) was for the M1, not M2.
itvirtuality said this on August 5, 2010 at 9:52 am
Very good to know. Thanks for the information!
Kevin Goodman said this on August 5, 2010 at 2:30 pm