Incident Date: 17th November 2022
On November 17, 2022, Autodesk Identity Authorization service experienced a service disruption that may have impacted customers’ ability to sign in to Autodesk cloud products and use cloud-connected workflows from within our desktop products between 5:14 AM PST to 12:01 PM PST.
- Autodesk cloud products and services, as well as desktop applications with cloud-based features were impacted.
- Customers experienced intermittent issues where they were unable to sign in or could not stay signed in to impacted products and services.
- As part of a planned upgrade for the Autodesk Identity Authorization service, we updated a third-party vendor database component and added a new replication target to an existing replication of the authorization service database cluster. Unfortunately, this led to an unexpected database contention and caused latency spikes for database queries.
- The database latencies resulted in sign in and authorization timeouts in impacted Autodesk products and services. The timeouts triggered the impacted products and services to execute “retry” behavior, resulting in a significant increase in traffic to the system, which caused a service disruption.
- To resolve this issue and support the increased load, we introduced multiple new server clusters and server traffic-handling. With the new, scaled infrastructure in place, we started restoring service at 8:17 AM PST. We restored service gradually, reaching 100% restoration at 12:01 PM PST.
Autodesk conducted a post-incident analysis of the event and identified actions we plan to take to prevent a recurrence of this issue. Some of these actions include:
- Engaging with our third-party vendor on remediating the database latency issue.
- Introducing improved high-availability and disaster recovery infrastructure, volume scaling, and policies for the supporting sign-in and authorization services. These changes will enhance sign-in and authorization services’ overall resiliency profile, with higher confidence fail-over and recovery.
- Expanding our service monitoring and observability capabilities to improve our ability for early detection, as well as support faster triage and recovery.
- Improving application traffic routing for managing infrastructure and server load. This will improve our services’ overall scale and availability when an exponential increase in traffic occurs.
- Introducing new load and production traffic simulation practices that will further validate and strengthen our resiliency and recovery measures.
Autodesk recognizes our responsibility to ensure maximum reliability and redundancy of our products and services, and we remain committed to consistently delivering reliable and world- class experiences for our customers. We thank you for your patience and understanding as we work to resolve this issue.