Archive for January, 2010
Persistance pays off
Posted by: Faisal Farooqui in Virtualization on January 24th, 2010
So last night after I got home from visiting my parents, I just so happened to realize that I haven’t received an email on my iPhone since the afternoon. I thought nothing much of it and just restarted my phone, still nothing Then I decided to try to log into OWA to see if the server even running, and it was so that was a relief, until I tried logging in. The exchange server wasn’t authenticating my credentials. I think figured small little glitch and I’ll just VPN into the network, restart a few services or services if needed and I’ll be on my way. But when I couldn’t even log into the VPN, that’s when I started getting concerned.
Now a million scenarios start going through my head on what the possible reasons could be effecting the issues that I am experiencing. I just needed to get onto the network somehow to see what the servers were doing. Did one of my ESX hosts just fail and HA didn’t kick in? Why isn’t even one of the three domain controllers accepting logins? Was there a network switch failure? And tons more questions just lingering.
So I can’t VPN into the network, so I decided to create a local login on our firewall so I can at least have that authenticate me and allow me to pass through to see what was going on. That worked! I got through and was able to ping a few servers, but now to investigate the rest of the issue.
I tried to SSH into all my ESX hosts to see if I was able to restart the management services only because I just figured the hosts may have just locked up and froze any of the VMs sitting on there. That didn’t work, the hosts werent recognizing the commands I would send it. Then I tried logging into virtual center to see what the cluster was up to and of course, that wasn’t possible either. Now I am just thinking what the heck is going on, its 11:30 at night, I really don’t want to drive back to the office ( since I just got home) and I was exhausted. My next natural instinct was to log into the ESX hosts directly via the vSphere client. My findings where pretty interesting.
I was able to log into ESX2 and ESX3, but not ESX1. That gave me some promise, because I could see some of the servers running on these hosts, but none of the three DCs were on the two available hosts. They had to have been sitting on the failed ESX1 host, and so was vSphere. Now I realized that I couldn’t vMotion these downed servers over to the running hosts, without virtual center, so how was I going to make this happen? I figured I’ll call VMWare support for assistance and they’ll help me get back up and running again in no time. They were really no help because apparently I had only Basic support which was good for Mon-Fri, and that the earliest support tech would call me on Monday morning. VMWare didn’t seem to care that this was a critical situation and wouldn’t make an exception. They said they could offer me a one time paid support assistance for $1200. I tried calling back and hoped for a different response, but got the same unfortunately.
Moving on, I was trying to figure out what I could do to get these servers up and running. I attempted to start another virtual center VM that was powered down after creating a clone from last week. I powered it up and was able to see all three ESX hosts running, but VMs sitting on ESX1 were all powered down! That is why I couldn’t authenticate. I vMotioned all the servers over to ESX2 and ESX3, powered them up, rebooted ESX1 just to make sure all was good and we were back in business!
I think what I got out of the experience is first that I am truly blessed with being able to figure these kinds of things out by being challenged with unusual issues like these, second, I should have created an HA rule not to have all three DCs sit on any one ESX host at the same time, and third to look into why the heck I wouldn’t have signed up for 24/7 support from VMWare. I really think I did, but maybe our vendor put in the wrong order. I’ll look on Monday.
In the end, I am just glad I didn’t just give up and not take responsibility for the crisis. Even after 20 years in IT, I’m still learning everyday so to me, persistence pays off.