VMWare/NAS/SAN: How To DOS Your Old NetApp With Snapshots

I feel that I have a pretty good foundation of knowledge for the ESX product line. Well, I found out that I did not know everything. It is now 2am and the last server is being restored from a snapshot on the NetApp. Earlier I was auditing the VCB backup logs and found about 12 servers that were failing due to “open snapshots on disk”. This normally occurs when the previous VCB operation fails. Trying to be a good administrator, I decided to go ahead and delete them before leaving the office, as to get a good run tonight. Here is where I made the mistake. The netapp is an older FAS3020 that is single headed. The NetApp performs well under normal conditions and only houses VMs for development.

I was outside hooking the kid bike trailer to the 12 speed when my wife came out holding my cell phone

Wife> Eric just called you twice
Me> Startup my laptop for me
... Call Eric
Eric> Sorry to bother you but  .... .

Luckily he noticed a couple of VMs dropped off the face of the earth. To be exact, we lost 4 virtual machines. When told to power on, they would error with “Can not open disk: PathToVmdk.. The parent virtual disk has been modified since the child was created“.

I never trust the GUI, so I went strait to the ESX console. Sure enough, things failed there also.

Per /var/log/vmkernel

Mar  5 23:02:59 vmsrv06 vmkernel: 28:09:39:57.271 cpu6:1215)<6>Debug scsi underrun
Mar  5 23:02:59 vmsrv06 vmkernel: 28:09:39:57.342 cpu6:1183)<6>Debug scsi underrun
Mar  5 23:02:59 vmsrv06 vmkernel: 28:09:39:57.413 cpu6:1098)<6>Debug scsi underrun
Mar  5 23:06:49 vmsrv06 vmkernel: 28:09:43:46.898 cpu15:1074)<6>qla24xx_abort_command(1): handle to abort=1497
Mar  5 23:06:49 vmsrv06 vmkernel: 28:09:43:46.898 cpu15:1074)<6>qla24xx_abort_command(1): handle to abort=1501
Mar  5 23:06:49 vmsrv06 vmkernel: 28:09:43:46.899 cpu15:1074)<6>qla24xx_abort_command(1): handle to abort=1499
Mar  5 23:06:49 vmsrv06 vmkernel: 28:09:43:46.899 cpu15:1074)<6>qla24xx_abort_command(1): handle to abort=1504
Mar  5 23:06:49 vmsrv06 vmkernel: 28:09:43:46.900 cpu15:1074)<6>qla24xx_abort_command(1): handle to abort=1496
Mar  5 23:06:49 vmsrv06 vmkernel: 28:09:43:46.901 cpu15:1074)<6>qla24xx_abort_command(1): handle to abort=1494
Mar  5 23:06:49 vmsrv06 vmkernel: 28:09:43:46.901 cpu15:1074)<6>qla24xx_abort_command(1): handle to abort=1502
Mar  5 23:10:06 vmsrv06 vmkernel: 28:09:47:04.375 cpu2:1051)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Mar  5 23:10:06 vmsrv06 vmkernel: 28:09:47:04.375 cpu2:1051)WARNING: FS3: 4785: Reservation error: SCSI reservation conflict

Next we will look at the NetApp. Below is from a console session. As you can see, the CPUs are hit hard.
From the NetApp

NAS>stats show system
system:system:nfs_ops:536/s
system:system:cifs_ops:0/s
system:system:http_ops:0/s
system:system:dafs_ops:0/s
system:system:fcp_ops:481/s
system:system:iscsi_ops:97/s
system:system:net_data_recv:136KB/s
system:system:net_data_sent:2323KB/s
system:system:disk_data_read:63752KB/s
system:system:disk_data_written:51640KB/s
system:system:cpu_busy:50%
system:system:avg_processor_busy:50%
system:system:total_processor_busy:100%
system:system:num_processors:2
system:system:time:1236313212s
system:system:uptime:9164385s

The conclusion is as follows: When VMware creates a snapshot, a new vmdk is created that references the new snapshot. The main vmx files will then be modified to link to the snapshot as well. When a snapshot is deleted, VMware is not just deleting a file from the file system. VMware gathers all the snapshots for that VM, collapses all their contained data into the main vmdk file, modifies the vmx files, then deletes the selected snapshots files. So in essence, when going down the list of all VMs that had snapshots(9-12) and deleting them, this caused the NetApp to overload. During this, the ESX hosts intermittently lost connectivity to the VMFS LUNS. This happened at such a fast rate, that no alerts were generated, but caused the VMs that were currently writing the snapshot to the main vmdk files to corrupt their configurations.

Moral of the story: Rapidly deleting snapshots can DOS older/slower storage systems.

About these ads

~ by Kevin Goodman on March 9, 2009.

One Response to “VMWare/NAS/SAN: How To DOS Your Old NetApp With Snapshots”

  1. Update: Forgot to mention that this also took down one of the NetApp fiber connected database servers. It was a fun night!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 1,372 other followers

%d bloggers like this: