iSCSI Boot Disk disconnect fix

In one of our labs we are using iSCSI to boot our ESXi5.1 hosts from a VNXe. If you lose connectivity to the NIC that runs the boot LUN (switch reboot, cable disconnect, etc.), you will see the following error:  Lost connectivity to the device backing the boot filesystem. As a result, host configuration changes will not be saved to persistent storage.

 iscsiboot-1

 

This error is being displayed because connectivity is lost and the iSCSI boot does not support Multi pathing, which means that if connectivity is lost between the Storage Processor (SP) on the VNXe and the NIC on the host, the host can no longer access its boot lun and cannot write logs etc.

The good news is that the whole ESXi OS is loaded into memory so there is no outage for the VMs. Once the connectivity is restored the host can access the storage again.

The bad news is that the error does not clear automatically. As no one likes to see errors/warning in their production environment I needed to find a solution to this issue.

The simplest solution is to put the host into maintenance mode, reboot it and the problem is solved. This however takes time, depending on the number of VMs on the host and how busy the environment is.

The next option is to restart the Management Agents on the host. Here are the steps to complete this.

Log into the console of the ESXi server, in my case it is a UCS server. On the main page make sure you are on the Server tab and then click “Launch KVM Consoleiscsiboot-2

The KVM Java applet is loaded and the KVM will be displayed.
iscsiboot-3

Click on F2 to login and click again on F2 to Customize System/View Logs.
Once there click on Troubleshooting Options.
iscsiboot-4

 

Here we will find the option we want, click on Restart Management Agents
iscsiboot-5

 A Confirmation dialog window opens as shown below
iscsiboot-6
Please read the warning, the host will automatically disconnect and reconnect from vCenter, however the VMs will continue to work without any outage. Click F11 to continue

The Management Agents will be stopped and started again
iscsiboot-7

If we look at vCenter we will see that the ESXi host is in a not responding state
iscsiboot-8

Back on the KVM console, once the restart is completed we can hit Enter to close the window and we can logout of the console by hitting the ESC key twice

vCenter will now show the host back in a normal state
iscsiboot-9

 

 

 

 

How about a faster way?

The fastest option is to use SSH to restart the Management Agents on the host. Of course we need to enable SSH first as it is disabled by default.iscsiboot-10

Click on the Host, Click on the Configuration Tab, Click on Security Profile under the Software section.
Click on Properties, scroll down to SSH and click on Options. To start the SSH service click on Start and click on Ok until all properties windows are closed.

After you enabled SSH, use your favorite SSH client to connect to the ESXi host.
When you are logged in as root run the following command:

/etc/init.d/hostd restart && /etc/init.d/vpxa restart

This will restart the hostd and vpxa agents right after each other.

Once completed, close the SSH session. Done!

NOTE: If you want to take it one more step further, you can suppress the SSH enabled warning within vCenter (not recommended!!!) and use kitty.exe to automatically login to the ESXi host and run the command listed above. In that way you only have to open Kitty, double click the host entry, and it all happens automatically.

 

 

 

Advertisements
Tagged with: , , ,
Posted in iSCSI, VMware
7 comments on “iSCSI Boot Disk disconnect fix
  1. Patrick says:

    Thanks! This worked for my issue, I used the restart management services and it was connected to a boot lun via Fibre Channel (FC).

  2. Johnny says:

    huge help! vmware support blamed it on the fact that we never had the correct SD media, so we dropped another $300 on industrial strength SD media and the error occurred again on the same host 2 weeks later. We had to evacuate all the vm’s on this host because we don’t have vmotion and it was a pain each time to reboot the server.

  3. For how long does this workaround keep the error from occurring? I’m running UCS B-series with iSCSI boot to NetApp. I also am using iSCSI multipathing but I would hate to continuously have to restart mgmt agents. Thanks for posting this!

    • It is not really a workaround, but a true fix. Once the error occurs because of one of the reasons spelled out in the post. Restarting the mgmt. agents makes the error go away and unless connectivity is lost again it will not come back. Now if you loose connectivity on a regular basis, or when you did not cause it to happen, there might be something bigger going on that requires additional troubleshooting. Hope this answers your question.

  4. Frank says:

    Is there a way to increase the timeouts for the lost connectivity check on the boot volume?
    We are using iSCSI boot with UCS b-series and NetApp. e.g. on a Controller Failover we get the warnings – but the LUN is just disconnected for a very short time…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Archives
%d bloggers like this: