Cisco Nexus 1000v 4.2 Upgrade Problems And Fix
Update
I’m glad Jeff (see comments) caught where I went wrong. After his comment I went back through the n1000v_upgrade_software PDF from cisco. For version 4.0(4)SV1(3a), the VEMs should have been upgraded first. It is on page 9 of the document.
Upgrading from Release 4.0(4)SV1(3, 3a, or 3b) to Release 4.2(1)SV1(4)
I was following lower down in the document that showed an upgrade of the VSM first then the VEM. Looking closer that was for one that required an intermediate upgrade between the current running and the newer 4.2(1)SV1(4) firmware. Thanks for the catch!
Original post:
So last weekend was our Cisco Nexus 1000v upgrade. We were upgrading from 4.0(4)SV1(3a) to 4.2(1)SV1(4). What should have been an easy upgrade really turned into a huge headache. Below is a walk-through of the upgrade process with notes on where things went wrong.
To start, lets verify the current running state of the VSM (1000v switch) and the VEM (ESX host modules). Below you will see that they are all on version 4.0(4)SV1(3a). Also note the standby VSM module number is 2. This will be the first one reloaded.
1000vSW# sh module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 1 0 Virtual Supervisor Module Nexus1000V active * 2 0 Virtual Supervisor Module Nexus1000V ha-standby 3 248 Virtual Ethernet Module NA ok 4 248 Virtual Ethernet Module NA ok 5 248 Virtual Ethernet Module NA ok 6 248 Virtual Ethernet Module NA ok 7 248 Virtual Ethernet Module NA ok 8 248 Virtual Ethernet Module NA ok 9 248 Virtual Ethernet Module NA ok Mod Sw Hw --- ---------------- ------------------------------------------------ 1 4.0(4)SV1(3a) 0.0 2 4.0(4)SV1(3a) 0.0 3 4.0(4)SV1(3a) VMware ESXi 4.1.0 Releasebuild-348481 (2.0) 4 4.0(4)SV1(3a) VMware ESXi 4.1.0 Releasebuild-348481 (2.0) 5 4.0(4)SV1(3a) VMware ESXi 4.1.0 Releasebuild-348481 (2.0) 6 4.0(4)SV1(3a) VMware ESXi 4.1.0 Releasebuild-348481 (2.0) 7 4.0(4)SV1(3a) VMware ESXi 4.1.0 Releasebuild-348481 (2.0) 8 4.0(4)SV1(3a) VMware ESXi 4.1.0 Releasebuild-348481 (2.0) 9 4.0(4)SV1(3a) VMware ESXi 4.1.0 Releasebuild-348481 (2.0)
Show boot displays the current location/firmware being used on bootup. This will be changed with the upgrade.
1000vSW# sh boot sup-1 kickstart variable = bootflash:/nexus-1000v-kickstart-mz.4.0.4.SV1.3a.bin system variable = bootflash:/nexus-1000v-mz.4.0.4.SV1.3a.bin sup-2 kickstart variable = bootflash:/nexus-1000v-kickstart-mz.4.0.4.SV1.3a.bin system variable = bootflash:/nexus-1000v-mz.4.0.4.SV1.3a.bin No module boot variable set
Just like any other devices, make sure there is enough room on the bootflash device to store the new firmware
1000vSW# dir Usage for bootflash:// 481083392 bytes used 1907187712 bytes free 2388271104 bytes total
After adding up the file size of the both files, there is enough on the bootflash to hold the new firmware. I already had the files on a host running scp, but the files can be loaded from any hosts running the following protocols
The syntax for copying over SCP is as follows
copy scp://user01@127.0.0.15/IOS/1000v/nexus-1000v-kickstart-mz.4.2.1.SV1.4.bin bootflash:// copy scp://user01@127.0.0.15/IOS/1000v/nexus-1000v-mz.4.2.1.SV1.4.bin bootflash://
From the active VSM (module 1 in this case), the new firmware is loaded into the configuration via the “install” command.
1000vSW# install all system bootflash:nexus-1000v-mz.4.2.1.SV1.4.bin kickstart bootflash:nexus-1000v-kickstart-mz.4.2.1.SV1.4.bin System image sync to standby is in progress... System image is synced to standby. Kickstart image sync to Standby is in progress... Kickstart image is synced to standby. Boot variables are updated to running configuration.
The boot variables were updated per the above, but it is still worth verifying. Below shows that the running configuration was updated to load the new firmware.
1000vSW# sh running-config | inc boot boot kickstart bootflash:/nexus-1000v-kickstart-mz.4.2.1.SV1.4.bin sup-1 boot system bootflash:/nexus-1000v-mz.4.2.1.SV1.4.bin sup-1 boot kickstart bootflash:/nexus-1000v-kickstart-mz.4.2.1.SV1.4.bin sup-2 boot system bootflash:/nexus-1000v-mz.4.2.1.SV1.4.bin sup-2
I almost skipped this command the first time. Make sure to save the configuration to startup-config or the reload of standby module will load up the startup-configs boot variables. That would bring it up on the same 4.0(4)SV1(3a) version.
1000vSW# copy running-config startup-config [########################################] 100%
We saw the module status earlier in this post, but I will reiterate it here. Module 2 shows as the standby VSM here, so it will be reloaded first.
1000vSW# sh module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 1 0 Virtual Supervisor Module Nexus1000V active * 2 0 Virtual Supervisor Module Nexus1000V ha-standby
Time to reboot the standby module. This is done as follows.
1000vSW# reload module 2 This command will reboot standby supervisor module. (y/n)? [n] y about to reset standby sup 1000vSW# 2011 Feb 25 19:39:33 1000vSW %PLATFORM-2-PFM_MODULE_RESET: Manual restart of Module 2 from Command Line Interface
Module 2 is still restarting, notice it missing under “Mod”
1000vSW# sh module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 1 0 Virtual Supervisor Module Nexus1000V active * 3 248 Virtual Ethernet Module NA ok 4 248 Virtual Ethernet Module NA ok
* you can bring up the console of the standby switch in vCenter to check progress
Ok, this is where everything went wrong for us. A typical upgrade would have the module connect back in showing the new version of code running on it. For us, this did not happen. Instead module 1 and 2 were not able to communicated with each other. Both decided to become active.
This became a big problem. All VMs running were still able to pass traffic across the network, but none of ESX hosts showed up as modules in the VSM, thus isolating them. First symptom I saw was connectivity loss via SSH to the VSM IP. It seemed to happen about every two minutes. luckily I saw the firmware version changing each time I did a show module after reconnecting. So the two 1000v modules were in IP conflict with each other, stealing it away from one another every few minutes.
I waited for the module running version 4.0(4)SV1(3a) to grab the IP again and issued a reload. This was module 1. Once it restarted and came up on the new firmware, it started talking with module 2.
1000vSW# sh module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 1 0 Virtual Supervisor Module Nexus1000V ha-standby 2 0 Virtual Supervisor Module Nexus1000V active * Mod Sw Hw --- ---------------- ------------------------------------------------ 1 4.2(1)SV1(4) 0.0 2 4.2(1)SV1(4) 0.0
Now that we are on the same firmware, it was time to make module 1 primary/active once more. This was non-intrusive.
1000vSW# system switchover
Once more to verify that module 1 has become active:
1000vSW# sh module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 1 0 Virtual Supervisor Module Nexus1000V active * 2 0 Virtual Supervisor Module Nexus1000V ha-standby Mod Sw Hw --- ---------------- ------------------------------------------------ 1 4.2(1)SV1(4) 0.0 2 4.2(1)SV1(4) 0.0
Also to be safe, have the VSM reconnect with Virtual Center.
1000vSW## configure t Enter configuration commands, one per line. End with CNTL/Z. 1000vSW#(config)# svs connection VirtualCenter 1000vSW#(config-svs-conn)# connect 1000vSW#(config-svs-conn)# end
Verify the connection as well.
1000vSW# show svs connections
connection vcenter:
ip address: 127.0.0.99
remote port: 80
protocol: vmware-vim https
certificate: default
datacenter name: Test
DVS uuid: 55 55 55 55 55 55 55 55
config status: Enabled
operational status: Connected
sync status: Complete
version: VMware vCenter Server
At this point we should be good. At least that is what I thought. Going back to what I said before
“but none of ESX hosts showed up as modules in the VSM.”
Yeah, I didn’t notice that untill after the fact. Normally I vmotion over an individual host to a different server to test connectivity, but this time I didn’t.
I selected our first vSphere (ESX) host and entered maintenance mode, thus vmotioning all VMs off to other hosts. Immediately they fell off of the network. All pings were lost. These were running critical/production servers. Of course we were in a maintenance window, but this still was worst case scenario. Normally each hosts VEM would show up in the show module, but they did not after the upgrade.
1000vSW# sh module Mod Ports Module-Type Model Status --- ----- -------------------------------- ------------------ ------------ 1 0 Virtual Supervisor Module Nexus1000V active * 2 0 Virtual Supervisor Module Nexus1000V ha-standby Mod Sw Hw --- ---------------- ------------------------------------------------ 1 4.2(1)SV1(4) 0.0 2 4.2(1)SV1(4) 0.0 Mod MAC-Address(es) Serial-Num --- -------------------------------------- ---------- 1 00-00-00-00-00-00 to 00-00-00-00-00-00 NA 2 00-00-00-00-00-00 to 00-00-00-00-00-00 NA Mod Server-IP Server-UUID Server-Name --- --------------- ------------------------------------ -------------------- 1 127.0.0.55 NA NA 2 127.0.0.55 NA NA
The above resembled what I was seeing. Just the VSM modules. The mac addresses were correct, I just changed them to all 0′s here. We tried shutting down a VM and moving it to different hosts in the cluster, but still no pings. I also tried stopping and starting the VEM via command line on the ESX host.
The only thing that fixed this was pushing the VEM upgrades. This was done using VMware Update Manager. Very easy, but we had to put the hosts into maintenance mode to do so. This wound up being very time-consuming. Once server 1′s VEM was updated, we were able to vMotion VMs to it without losing network connectivity to them. It went like this:
It would have been a lot easier if the hosts could have been placed into maintenance mode and VMs automatically migrated to other members in the cluster. Once completed, everything looked fine in the VSM. Even the hosts showed up!
Mod Server-IP Server-UUID Server-Name --- --------------- ------------------------------------ -------------------- 1 127.0.0.55 NA NA 2 127.0.0.55 NA NA 3 127.0.0.24 22222222-2222-2222-2222-22222222221g vm_host01 4 127.0.0.26 22222222-2222-2222-2222-22222222222g vm_host03 5 127.0.0.27 22222222-2222-2222-2222-22222222223g vm_host04 6 127.0.0.28 22222222-2222-2222-2222-22222222224g vm_host05 7 127.0.0.25 22222222-2222-2222-2222-22222222225g vm_host02 8 127.0.0.29 22222222-2222-2222-2222-22222222226g vm_host06 9 127.0.0.120 22222222-2222-2222-2222-22222222228w vm_host07
A new rule for us to now follow: Always put one host into maintenance mode before upgrading!
This would have allowed us to upgrade without downtime in this scenario.

I could be mistaken, but when I read the extensive 1(4) upgrade documentation (and watched the videos), I thought the process was to update the VEMs to 1(4) first, then update the VSM’s to 1(4)?.
JeffS said this on March 1, 2011 at 12:41 pm
You are right. I went back through the n1000v_upgrade_software PDF from cisco. For version 4.0(4)SV1(3a), the VEMs should have been upgraded first. It is on page 9 of the document.
Upgrading from Release 4.0(4)SV1(3, 3a, or 3b) to Release 4.2(1)SV1(4)
Step 1 Upgrading the VEMs: Release 4.0(4)SV1(2, 3, 3a, 3b) to Release 4.2(1)SV1(4), page 20
Step 2 Upgrading the VSMs to Release 4.2(1)SV1(4) Using the Upgrade Application, page 33
I was following lower down in the document that showed an upgrade of the VSM first then the VEM. Looking closer that was for one that required an intermediate upgrade between the current running and the newer 4.2(1)SV1(4) firmware. Thanks for the catch! Will update the post shortly.
Kevin Goodman said this on March 1, 2011 at 2:40 pm