Microsoft expands ‘outage mode’ for Azure Active Directory • The Register
Microsoft hopes to improve the resiliency of its cloud services by extending an “outage mode” for Azure Active Directory to cover web and desktop applications.
Azure Active Directory (AAD) is Microsoft’s cloud directory that handles authentication for Office 365 and can be linked to on-premises Active Directory. Additionally, developers can write applications that use the service. However, if something goes wrong, customers experience multiple failures, including being unable to access the Azure portal to manage other cloud services.
In December last year, Microsoft updated its Service Level Agreement (SLA) for AAD to 99.99% uptime, instead of 99.9%, but with a sleight of hand because it also Removed “administrative functionality” from its definition of availability.
Now the company has given More details on its efforts, focusing on a fallback authentication service that replicates authentication data during normal operations, then if the main service fails, goes into “failure mode” where it is able to verify requests and provide tokens to customers.
According to Microsoft, this has worked for Outlook Web Access and SharePoint Online since 2019, although we note that during the September 2020 outage Outlook and SharePoint were affected. The reason given at the time was that “a recent configuration change impacted a core storage layer”, an issue that was compounded by another issue caused by “a change put in place to mitigate the impact”. So it seems that the backup service was not enough in this case.
There is also a limitation in that authentications are only processed by the backup service if the user has previously accessed an “application or resource” within the last three days, the so-called “window storage”. The company found this to be acceptable for most users who “access their most important apps daily from a consistent device”, but it’s easy to think of cases where users will be blocked, for example s they buy a new device.
It’s better than nothing though, and Microsoft has been busy expanding its applicability. Earlier this year, support for desktop and mobile apps was added, and next year more web apps, including Teams Online and the rest of Office 365, will be added as well. Client applications using Open ID Connect will follow shortly.
More questions than answers
In some ways, Microsoft’s latest article poses more questions than answers. A quick look at the Azure status page shows “Azure Active Directory – Problems trying to authenticate”, although possibly limited to customers using external Azure Active Directory identities, with root cause attributed to “outbound port exhaustion”, although the It’s unclear where this is on the enterprise architecture diagram.
In March of this year, there was an extended AAD outage caused by the mistaken deletion of a key used for cryptographic signing. Microsoft referenced the backup service at the time and said that “Unfortunately it didn’t help in this case as it provided coverage for the token issuance but didn’t provide coverage for the token validation as it depended on the impacted metadata endpoint.”
It therefore appears that the extension of the backup service will not solve all the problems that may impact AAD even if it is beneficial.
In August of this year, Gartner analysts reported that customers “remain concerned about the real-world impacts” of Azure’s reliability, even though its performance is not bad in absolute terms. Gartner considers some Azure regions to be less resilient than they should be, possibly due to capacity issues – but note that the pandemic has caused an increase in demand for all cloud providers.
Microsoft also has questions to answer regarding the Cosmos DB vulnerability described by Wiz security researchers earlier this month. The vulnerability has been patched, but researchers have identified what looks like extraordinary architectural errors, such as firewall rules designed to prevent breach escalation, but “these firewall rules were configured locally on the container where we were currently running as root. So we simply removed the rules (by issuing iptables -F), paving the way for those banned IPs and even more interesting discoveries.”
It’s a good thing when Azure CTO Mark Russinovich pops up to talk to us and colleagues about Azure’s reliability improvements, and the extended AAD backup service is welcome even if not always efficient, but we would like to know more about these other pressing situations. ®