[cont.] Issues with slow sql causing system outage
Incident Report for Codecov
Postmortem

Issue

SQL queries backed up which crippled the system. The first query to cause a snowball effect concerned retrieving large report data.

Identification

We reviewed logs and watched queries to identify the culprit repositories. This was, however, painful because the log details were not specific enough to query properly.

Resolution

Once we identified the culprit repositories we first, momentarily, blocked the uploading and processing of reports. We then changed the storage strategy to utilize a new scaling technique we have been working on.

Changes
  • added new logging data to help identify large projects and slow queries
  • using new storage strategy for large projects
    • improves overall performance of frontend page builds and sql queries

Thanks

We appreciate the love and respect from the community. Above all your patience humbles us.

Thank you for the #hugops :)

<3 The Codecov Team

Posted Mar 22, 2017 - 12:44 UTC

Resolved
This incident has been resolved.
Posted Mar 21, 2017 - 23:25 UTC
Update
Worker queue is under 500 and dropping quickly. We have implemented new procedures to store reports.

System has been stable for ~2 hours now.
Posted Mar 21, 2017 - 23:22 UTC
Monitoring
Working on our job queue. Thank you for your patience.
Posted Mar 21, 2017 - 22:46 UTC
Investigating
Server continue to battle sql query lag. We are dedicated to resolve this issue asap.
Posted Mar 21, 2017 - 21:32 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 21, 2017 - 20:54 UTC
Investigating
Sorry, but this continues to come up. We are working hard to manage the system and identify the culprit repositories.

There is one or more project that have massive reports causing the outage. More information will come soon. Thank you for your patience!
Posted Mar 21, 2017 - 20:37 UTC