On February 12th 2021, the hit rate of the “django” service in Datadog crept up to abnormal levels over the course of a few hours, and remained high for a total of 26 hours. This coincided with a release of the new Django API’s Bitbucket Webhook Handler (the resource driving the traffic increase, according to Datadog). About 1.5 hours after the release of this code, we received a report that a customer was experiencing upload timeouts and not able to find their reports in our archives. Using dashboards and Watchdog alerts from Datadog, the team figured out that the increased Bitbucket webhook traffic was overloading the “new tasks” celery queue with extraneous “notify” tasks.
February 12th, morning UTC: Codecov released Bitbucket webook handler via ambassador to 100% of traffic
February 12, 2:00pm: Traffic to Bitbucket webhook handler is not causally linked to the issue yet, but is suspected since release fits issue timeline. Traffic is reduced to 0% via Ambassador.
February 12, 2:20pm: Watchdog reports traffic spike eases to 0 requests/s
February 12. 3:13pm: A bug in the Bitbucket webhook handler source code released is found and identified as the root cause in a live war-room Zoom
February 12: 3:42pm: A fix for bug is created in a PR to new API
February 12, 7:40pm: Number of tasks awaiting processing in the “new tasks” returns to normal level (0 tasks). User behavior returns to normal.
February 16th: Fix is approved and deployed
The heart of the issue was an erroneous condition in the new webhook handler.
The implementation of this webhook handler caused us to to create an infinite loop. This influx of tasks flooded our task queue and hogged resources, causing problems for upload.
The reason this error was not caught during deployment is that our Bitbucket example traffic in staging was insufficient to trigger long term issues. Since Codecov’s staging environment only has access to Codecov repositories, and Codecov doesn’t have any Bitbucket repositories, tests were not performed in the staging environment.
We have many preventative measures in place to tease out and fix bugs before they’re released. In this case, the offending lines of code were covered by unit tests. They passed review from another developer.
We also take other preventative measures at the time of release to help us manage issues:
This issue was tricky because it only showed itself when the service was handling enough traffic to cause the task queue back up. For that reason, it doesn’t seem like additional testing in staging would have caught it.
What we’re missing in our workflow is a process by which we release new services to larger and larger groups of customers, and how we check the health of those services after they’re released. This service, for example, was released at 5% traffic, then went straight to 100%. Additionally, only very basic checks were performed upon release to verify the health of the service (“is the API still up”, “are there any new sentry errors”, etc). It would be better to have a comprehensive checklist of metrics and dashboards to verify a recently released service is healthy.
We apologize for inconvenience caused to you, our users. We aim by sharing transparently the above, we can both share with you our learning and hold ourselves to a higher standard next time.