Delivery API data is delayed

Degraded

Delivery API data is delayed

Jan 03 at 09:34am CET

Affected services

Delivery API

Resolved
Jan 05 at 09:31pm CET

We want to follow up on Tuesday's incident. Our investigations have shown that multiple events were the cause of the incident.

The first event was a series of schema deploys on a large tenant. When working with large data sets it is not uncommon for a schema deployment to take a few minutes, and in some cases, 10-20 minutes is to be expected. In this particular case, a total of 15 schemas were deployed, and each needed a few minutes. At some point, the specific developer didn't see the expected views in the Delivery API and the developer decided to re-deploy some of the schemas.

Normally this is not a cause for concern, as the processing layer simply works through the queue of jobs. But one of the schemas with multiple deploys triggered the processing of a particular source entity type, with a high number, of large source entities. This number of entities combined with the size of each entity triggered a bug in our code. And when the added impact of multiple schema deploys, the bug caused our processing layer to be locked up and essentially stopped processing.

We reacted by deploying a new version of the platform that had a "kill switch" feature, that can clear out messages for a particular tenant. This had an immediate effect and stopped the processing queue from growing. Unfortunately, the new version had a performance regression bug that significantly reduced the processing capabilities.

We then deployed yet a new version of the platform with the "kill switch", but without the performance regression bug. Then we were quickly able to clear out the queue.

We want to underline that the incident is not the fault of the developer triggering multiple deploys, it might have contributed to the size of the problem, but it was not the root cause of the problem.

We still have some unanswered questions, especially why this bug just now caused the processing layer to become backlogged? Both the buggy code and the schema version with the specific source entity type has, so far as our investigation has shown, been in the platform for multiple months without triggering the processing layer to become backlogged. Our main theory is that other optimisations have generally allowed the processing layer to process more, faster, and therefore overall increasing the load. But we have chosen not to investigate further.

We have learned enough to start working on initiatives to make sure that a similar incident can't happen again. This incident shares some of the characteristics of the previous incident in December. It is beyond frustrating to again experience the processing layer is backlogged, and it takes hours to clear the queue. The cause of the problem was different, but the symptoms and effects for our uses were the same. We are glad to see that one of the actions from the last incident proved to help mitigate the situation, but we are not satisfied with the time it took to deploy a functioning version.

The first action is of course to fix the bug that was triggered in this incident. Secondly, we continue to work on mitigating actions around the isolation of a tenant's processing queues, detection of potential cascading events and detection of potential "over processing".

Furthermore, we are also working on providing more feedback for the developers using Enterspeed, so they can better understand the processing timings.

The performance regression bug introduced in the first version of the kill switch also bears mentioning. We deployed without performance tests had been done. We had done our code reviews, and automated and manual tests, so our confidence was high. But despite our confidence in the code, a performance regression bug was deployed.

This leads us to improve our load performance test setup. With the experience of this incident, we can conclude that setting up these tests requires too much manual work, and therefore they are done only when we work directly with performance-critical areas. This leaves too much room for regression bugs in seemingly unrelated areas.

Updated
Jan 03 at 03:34pm CET

The queue is now completely processed and we are back to normal processing times.

We are very sorry for the delays. We are still working through the data to understand what triggered this. We will post a full incident report when we know more.

Updated
Jan 03 at 03:25pm CET

We have now deployed a fix to the kill switch and are now seeing the expected performance. We expect the queue to be processed in ~10 minutes (15.35 CET).

We will update again when the processing is done.

Updated
Jan 03 at 01:44pm CET

As previously reported we have triggered our tenant "kill switch" on the specific tenant where this issue is occurring. The newly introduced kill switch is not performaning the purging as fast as we expected.

Our current estimate is for sometime between 19.00 and 20.00 CET.

We are currently working on significantly improving the performance of the kill switch.

Please remember that we are prioritising data consistency for all other tenants, we would rather live with some delays, than risk data inconsistencies.

Updated
Jan 03 at 11:26am CET

We have identified the issue and isolated the issue to a single tenant. We are now working to remove the backlogged messages for this specific tenant.

Created
Jan 03 at 09:34am CET

Our monitoring has picked up on delays in our processing layer. We are currently investigating the issue.