This was a very simple bug that we hit about 4 months ago. Unfortunately it threw us for a loop that night. One of our 6120 switches needed to be replaced due to a few bad ports. What should have been very routine turned out to cause a lot of headache. This bug was hit in firmware version 1.2. Per Cisco, this should have been fixed in 1.3.
I removed the faulty switch after disconnecting all cabling. The new 6120 was put on the network and firmware updated to match the cluster. After the upgrade was done, I erased the configuration on the switch and connected the fabric interconnects.
At this point, I ran through the CLI setup utility. Everything looked find and I though the configuration push was successful due to the following information being returned:
Apply and save the configuration (select 'no' if you want to re-enter)? (yes/no): yes
Applying configuration. Please wait.
Configuration file - Ok
When I went into the UCS manager, the switch didn’t look right. Once into the CLI on the new 6120, the following error confirmed there was a problem.
6120-A# scope fabric-interconnect a
Note: SAMCLI process currently not running
That definitely wasn’t a good thing! The 6120-B member did not show 6120-A as working in its “show cluster state”. Below is the needle in the haystack. After connecting to local management on the switch, it threw a configuration error.
6120-A# connect local-mgmt
svcconfig init: Ini file syntax error at line 127.0.0.10
At that time of night ~2am, it didn’t stand out what that IP address was. I called Cisco TAC and waited for an hour. It was getting pretty late so we called off the maintenance. I put the faulty switch back in. It was still functioning and cables moved from the bad ports over to working ones.
I erased to configuration of the replacement switch and brought it back up in standalone mode. The next day I went to check that switch out and all errors were gone. It seemed to be functioning fine when it was not a member of the cluster.
I used Putty client to log all output while I was connected via serial to the switches that nigh. After reviewing them, I came across the syntax error line multiple times. Then the light bulb came on. The IP 127.0.0.10 is the secondary DNS server on our network.
I already had a ticket opened with Cisco TAC on this. Called back in and asked if they had any bugs that were due to DNS issues. Turned out they were two of them. The IDs are as follows:
Both of the links above require a cisco.com login, but information on them are also listed below:
“CSCtg14999 has been superseded by CSCtf95156 displayed below.
CSCtf95156 Bug Details
secondary DNS entry not properly copied to peer during clustering
If you have 2 DNS servers configured in UCSM and then attempt to peer with a new FI (i.e. a subordinate box is added to the setup or a write erase is done to an existing subordinate) the first DNS entry is copied over properly but the second one is missing the ‘dnsServer’ prefix and ends up in the sam.config file (/opt/db/sam.config) as just an IP address on a line.
As a result the config is not properly read and clustering does not occur.
Deleting the second DNS IP address from the config will allow the FI to cluster (assuming all other criteria are correct and in place).”
So per the above information, ID CSCtg14999 looks have been replaced or is a child bug of CSCtf95156.
The fix: I removed the DNS servers from UCS configuration, and replaced the switch during another maintenance window. This time, it worked great. Was in and out in about 1.5 hours.