I am the founder of a startup called Cotega and also a Microsoft employee within the SQL Azure group where I work as a Program Manager. This is a series of posts where I talk about my experience building a startup outside of Microsoft. I do my best to take my Microsoft hat off and tell both the good parts and the bad parts I experienced using Azure.
On Feb 29th, I woke up to learn that Windows Azure (the hosting site of the Cotega monitoring service) was experiencing a service outage. It surprised me that it wasn’t an email from Windows Azure or from an angry customer that notified me of this issue, but rather I learned it from reading Hacker News. When I first saw the post I immediately checked the Cotega worker role and sure enough it had stopped monitoring user’s databases. This was more than a little embarrassing for me since I had tried to follow the golden rule when creating services of “Planning for failure”. I thought I had done everything possible to deal with failures. I had multiple worker roles to handle the case where one machine fails. As I discussed earlier, I even moved my SQL Azure queueing system to Windows Azure queues because of issues I would have monitoring a user’s database if SQL Azure failed. What I did not plan for, was the case when Windows Azure itself became unavailable. When I first learned of the outage, my first thought was that maybe it was a bad idea to rely so heavily on the cloud for my service. But soon after I realized that this was really just a failure on my part, not on the cloud. All services will have issues at some point in time. This is true regardless of whether they are in the cloud or on premises. I had not considered all critical failure scenarios. Not only did I not have a solution to handle failures like this, I did not even know about the failure until I read about it in the news. In the back of my mind I had always thought either Windows Azure or a customer would notify me if there was an issue like this, but even if they had, this would not have helped me automatically handle the failure. What I took away from this is that I needed to automate the failover of the monitoring service. To do this I would need to set up a separate application to watch my worker roles. Since pinging worker roles is disabled by default and I prefer to avoid opening up this port, the next best thing seems to be to check the logging table from a machine outside of the Windows Azure data centers. If there were no entries in the log table in last 5 minutes, then there must be an issue and it would then:
- Send me an email and SMS notification of the failure
- Start running the monitoring from a separate location
Then once Windows Azure was back up and running, I could reset this tool back to the state of pinging the worker roles and disable monitoring from this separate location. Luckily, the monitoring service is built to handle multiple instances of the monitoring service running at the same time, so when Windows Azure comes back up, the separate monitoring service will just work in parallel to it until I shut it down. The only problem I still need to handle is the case where Windows Azure queues becomes unavailable as this is a central service that is needed by my service. I’ll have to think about how to handle this failure a little more. In any case, this was a really good test for me to learn from a major service failure and I think this will ultimately make the final service more more solid.