Azure is Down for A.I. Companies

Microsoft's cloud computing service Azure is having critical failures in the South Central Region (their San Antonio datacenter). This is the primary datacenter where Microsoft launched and hosts it's GPU virtual machines (VMs, others might know these as 'instances' if you're an AWS or GCP user).

By checking Azure's Status, you can still see widespread failures, even over 12 hours later. (outage started at around 2am PST, the following screenshot was taken around 2pm PST -- there is still massive outage.)


Microsoft's communication about this problem was very lacking. They did not notify us at Deepgram. Only when we went to file a support ticket (when our own 'canary' type systems that live in our own datacenter were telling us something was wrong with our cloud GPU machines) were we given information on the problem.

The notice read:

"[Microsoft Azure customers] may experience difficulties connecting to resources hosted in this region. Engineers have isolated an issue with cooling in one part of the data center, which caused a localized spike in temperature, as the preliminary root-cause, which has now been mitigated. Automated data center procedures to ensure data and hardware integrity went into effect when temperatures hit a specified threshold and critical hardware entered a structured power down process.  Engineers are now in the process of restoring power to affected devices as part of the ongoing mitigation process."


Attempts to contact Azure by phone have not been fruitful.

This hits A.I. companies particularly hard because Azure was the first to offer good GPUs (like the Tesla P100) available for general use in a large cloud compute provider. Companies looking to train and host their inferencing engines on GPUs flocked to Azure during their release and are still hosting critical infrastructure there. 

Now companies use Azure for training or inference as a key piece of infrastructure. Deepgram uses Azure cloud GPUs for large scale inferencing [that's the part where you make a machine guess "what did someone say in an audio recording?", or for images "what is in an image, or for translation "what would these words be translated to in another language?".

We are starting to see relief from the outage as I write this post, but we aren't out of the woods yet (as of: Tue Sep  4 14:31:22 PDT 2018). More updates to come.


Update Tue Sep  4 15:03:02 PDT 2018:
We have restored partial service. Things have been stable for 15 minutes. Looking good for getting ramped back up but not out of the woods yet.

Update Tue Sep  4 15:03:02 PDT 2018: 
Partial service still restored but outages still prevalent. Azure's status page even has problems.


Wed Sep  5 14:30:49 PDT 2018:
Partial service still restored but short outages still prevalent.

It's been over 36 hours and South Central is still in bad shape.