Codecov Uploader failing intermittently

Incident Report for Codecov

Postmortem

Issue Summary

On February 12th 2021, the hit rate of the “django” service in Datadog crept up to abnormal levels over the course of a few hours, and remained high for a total of 26 hours. This coincided with a release of the new Django API’s Bitbucket Webhook Handler (the resource driving the traffic increase, according to Datadog). About 1.5 hours after the release of this code, we received a report that a customer was experiencing upload timeouts and not able to find their reports in our archives. Using dashboards and Watchdog alerts from Datadog, the team figured out that the increased Bitbucket webhook traffic was overloading the “new tasks” celery queue with extraneous “notify” tasks.

Timeline (EST)

February 12th, morning UTC: Codecov released Bitbucket webook handler via ambassador to 100% of traffic

February 12, 2:00pm: Traffic to Bitbucket webhook handler is not causally linked to the issue yet, but is suspected since release fits issue timeline. Traffic is reduced to 0% via Ambassador.

February 12, 2:20pm: Watchdog reports traffic spike eases to 0 requests/s

February 12. 3:13pm: A bug in the Bitbucket webhook handler source code released is found and identified as the root cause in a live war-room Zoom

February 12: 3:42pm: A fix for bug is created in a PR to new API

February 12, 7:40pm: Number of tasks awaiting processing in the “new tasks” returns to normal level (0 tasks). User behavior returns to normal.

February 16th: Fix is approved and deployed

Root Cause

The heart of the issue was an erroneous condition in the new webhook handler.

The implementation of this webhook handler caused us to to create an infinite loop. This influx of tasks flooded our task queue and hogged resources, causing problems for upload.

The reason this error was not caught during deployment is that our Bitbucket example traffic in staging was insufficient to trigger long term issues. Since Codecov’s staging environment only has access to Codecov repositories, and Codecov doesn’t have any Bitbucket repositories, tests were not performed in the staging environment.

Resolution and Recovery

Upon becoming suspicious of the new Bitbucket webhook code, the team disabled traffic to this endpoint via Ambassador.
Once the offending bug was identified, a fix was pushed up.
It took several hours for the “new tasks” queue to drain after traffic to the affected code was reduced, since the backup was so large

Corrective and Preventative Measures

We have many preventative measures in place to tease out and fix bugs before they’re released. In this case, the offending lines of code were covered by unit tests. They passed review from another developer.

We also take other preventative measures at the time of release to help us manage issues:

We utilize an agile continuous delivery strategy that allows us to iterate on our production system extremely quickly. This helps us address bugs that only manifest when they’re rolled into production fast.
We use Ambassador to implement canary-style releases of our code, so we can slowly dial up the number of users that might experience an issue.

This issue was tricky because it only showed itself when the service was handling enough traffic to cause the task queue back up. For that reason, it doesn’t seem like additional testing in staging would have caught it.

What we’re missing in our workflow is a process by which we release new services to larger and larger groups of customers, and how we check the health of those services after they’re released. This service, for example, was released at 5% traffic, then went straight to 100%. Additionally, only very basic checks were performed upon release to verify the health of the service (“is the API still up”, “are there any new sentry errors”, etc). It would be better to have a comprehensive checklist of metrics and dashboards to verify a recently released service is healthy.

Action Items

Mock a.) Bitbucket data in our staging environment and b.) mock larger and larger data sets within our staging environment to more closely align with production level traffic.
Draft a spec for how to gradually release a service via Ambassador to a progressively wider audience, and a checklist to use when verifying the health of the recently released service.

‌

We apologize for inconvenience caused to you, our users. We aim by sharing transparently the above, we can both share with you our learning and hold ourselves to a higher standard next time.

Posted Mar 12, 2021 - 16:32 UTC

Resolved

The patch the to the uploader has resolved the delays and failures experienced on Thursday and Friday. If your upload is still in an error state, please consider re-running CI on the commit in error.

We apologize for any inconvenience caused.

We would note that only users of the legacy, v2 uploader, were affected. If you are using legacy v2 of the Codecov uploader, we strongly recommend upgrading to the v4 uploader via Bash: https://docs.codecov.io/docs/about-the-codecov-bash-uploader

You can see if you are using the legacy v2 uploader by searching your Codecov uploader output from CI and/or if you are using uploaders like Codecov Node v3.7.2.

Posted Feb 13, 2021 - 16:17 UTC

Update

Posted Feb 13, 2021 - 16:16 UTC

Monitoring

A fix has been implemented and our processing queues have returned to normal. We continue to monitor current and past uploads.

If your upload is still in an error state, please consider re-running CI on the commit in error.

Posted Feb 13, 2021 - 00:56 UTC

Identified

We've identified a backup in our processing queues, that was causing delayed or failed uploads. We are implementing changes to streamline the function of the uploader.

Posted Feb 12, 2021 - 19:20 UTC

Investigating

Codecov is investigating intermittent failures of the Uploader

Posted Feb 12, 2021 - 18:46 UTC

This incident affected: Codecov Backend.