Issue

SQL queries backed up which crippled the system. The first query to cause a snowball effect concerned retrieving large report data.

Identification

We reviewed logs and watched queries to identify the culprit repositories. This was, however, painful because the log details were not specific enough to query properly.

Resolution

Once we identified the culprit repositories we first, momentarily, blocked the uploading and processing of reports. We then changed the storage strategy to utilize a new scaling technique we have been working on.

Changes

added new logging data to help identify large projects and slow queries
using new storage strategy for large projects
- improves overall performance of frontend page builds and sql queries

Thanks

We appreciate the love and respect from the community. Above all your patience humbles us.

Thank you for the #hugops :)

<3 The Codecov Team

Posted Mar 22, 2017 - 12:44 UTC

Resolved

This incident has been resolved.

Posted Mar 21, 2017 - 23:25 UTC

Update

Worker queue is under 500 and dropping quickly. We have implemented new procedures to store reports.

System has been stable for ~2 hours now.

Posted Mar 21, 2017 - 23:22 UTC

Monitoring

Working on our job queue. Thank you for your patience.

Posted Mar 21, 2017 - 22:46 UTC

Investigating

Server continue to battle sql query lag. We are dedicated to resolve this issue asap.

Posted Mar 21, 2017 - 21:32 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 21, 2017 - 20:54 UTC

Investigating

Sorry, but this continues to come up. We are working hard to manage the system and identify the culprit repositories.

There is one or more project that have massive reports causing the outage. More information will come soon. Thank you for your patience!

Posted Mar 21, 2017 - 20:37 UTC