Beginning at 11:20 UTC, Aug 10th, the Fraudnet US-East, US-West, and Canada data center region experienced increased error rates. This prevented Events from processing using our Marketplace Risk check and Update Rest APIs in those data center regions. This incident was triggered by a software bug in release 1.9.228.0.
Our engineers were able to detect and begin an incident response for this issue at the first signs of impact. We quickly determined the root cause and began remediation efforts.
We discovered the root cause to be a race condition in our production deployment flow when adding new sectors that caused internal routing systems to crash.
Fraudnet engineers reverted the release software upgrades to the impacted regions to fix this issue. Isolating the problem was the most time-consuming part of the response, and when completed, they resolved the bug in the affected systems for the long term.
Timeline of Events ( UTC )
11:20 - Software release 1.9.228.0 in US-EAST, US-WEST, CA-WEST causing errors in Risk Check and Update API
11:30 - Internal incident response detected and reported.
12:35 - The engineering team gains access to production and begins their investigation.
11:51 - The engineering team identified an issue isolated with the marketplace sector in the current release.
12:02 - Deployment Rollback to Software release 1.9.227.0 started
12:06- Engineering team confirms deployment rollback is completed
12:09- All customer impact errors confirmed over.
Future Measures
As a result, future deployments now have measures in place to reduce the time needed to report issues during deployment. Additionally, the code that caused the race condition has been refactored, fixing any bugs that existed in the previous deployment flow.