[cont.] Issues with slow sql causing system outage
Incident Report for Codecov


SQL queries backed up which crippled the system. The first query to cause a snowball effect concerned retrieving large report data.


We reviewed logs and watched queries to identify the culprit repositories. This was, however, painful because the log details were not specific enough to query properly.


Once we identified the culprit repositories we first, momentarily, blocked the uploading and processing of reports. We then changed the storage strategy to utilize a new scaling technique we have been working on.

  • added new logging data to help identify large projects and slow queries
  • using new storage strategy for large projects
    • improves overall performance of frontend page builds and sql queries


We appreciate the love and respect from the community. Above all your patience humbles us.

Thank you for the #hugops :)

<3 The Codecov Team

Posted over 2 years ago. Mar 22, 2017 - 12:44 UTC

This incident has been resolved.
Posted over 2 years ago. Mar 21, 2017 - 23:25 UTC
Worker queue is under 500 and dropping quickly. We have implemented new procedures to store reports.

System has been stable for ~2 hours now.
Posted over 2 years ago. Mar 21, 2017 - 23:22 UTC
Working on our job queue. Thank you for your patience.
Posted over 2 years ago. Mar 21, 2017 - 22:46 UTC
Server continue to battle sql query lag. We are dedicated to resolve this issue asap.
Posted over 2 years ago. Mar 21, 2017 - 21:32 UTC
A fix has been implemented and we are monitoring the results.
Posted over 2 years ago. Mar 21, 2017 - 20:54 UTC
Sorry, but this continues to come up. We are working hard to manage the system and identify the culprit repositories.

There is one or more project that have massive reports causing the outage. More information will come soon. Thank you for your patience!
Posted over 2 years ago. Mar 21, 2017 - 20:37 UTC