Intermediate errors in processing
Resolved
May 08 at 04:00pm CEST
Enterspeed Processing Post-Mortem: 7th May 2024
On the 7th of May 2024, an incident occurred that resulted in some views not being generated for users. This impacted their ability to access updates from their CMS or other sources on their websites. The incident stemmed from a bug introduced during a change aimed at providing more insight into the jobs queue on the 6th of May. Notably, Enterspeed's Ingest and Delivery API remained unaffected.
If you have been affected by this bug, we recommend deploying all schemas. Ingesting the affected source entities will only have effect if the source entities’ data is changed from the already ingested source entity. Additionally, feel free to reach out on your Slack support channel or via support@enterspeed.com for assistance.
What happened?
Around noon on the 7th of May, reports surfaced from users stating that some newly ingested source entities were not processed, leading to missing views. Although initially described with uncertainty, a clear pattern emerged from these reports.
At 13:42 CET, we officially declared an incident on our status page. The engineering team promptly convened to identify and resolve the bug. Initial investigations, including log reviews and monitoring, showed no logged errors, and processing appeared functional.
Subsequent review of recent deployments revealed a change to our fairness queue, which triggered the observation that our deduplication feature was removing an excessive number of processing jobs. We then focused on this part of the system.
Despite no glaring issues found in the code upon review, we were able to reproduce the bug around 15:30 CET, identifying a pattern where the issue occurred only during simultaneous processing of multiple environments.
By 18:41 CET, a fix was deployed to production, restoring normalcy to the deduplication monitoring graph.
Root Cause
For an incident like this to happen multiple steps needs to fail, including peer reviews, manual testing, and automated checks. Neither the engineer nor the code reviewer fully grasped the consequences of the code change. Despite relying on automated testing, the scenario leading to the bug was not covered. The root cause was attributed to human error in misunderstanding how the deduplication feature functions.
Lessons Learned
Key takeaways include the need to enhance automated testing to cover such scenarios and consider expanding logging for better error identification. Plans are in place to add more environment-based rules to the fairness queue, incorporating these lessons into the platform moving forward. In the short term, we will expand our automated tests to cover similar scenarios.
Final Words
We recognise the severity of this incident and acknowledge its impact on user trust. We apologise for any inconvenience caused and are committed to avoiding similar issues in the future. For any questions or concerns, please reach out on your Slack support channel or via support@enterspeed.com.
You can learn more about the technical details of our fairness queue in our blog post.
Affected services
Processing
Updated
May 07 at 08:14pm CEST
The fix continue to show the expected results. We will monitor the platform and update this page if anything changes.
Please reach out on you Slack support channel or via support@enterspeed.com if you have any questions.
Affected services
Processing
Updated
May 07 at 06:48pm CEST
We have deployed a fix and we will continue monitor the platform to observe that the fix has the desired outcome.
Affected services
Processing
Updated
May 07 at 03:49pm CEST
We have identified the source of the problem and we are now working to restore 100 % processing.
We have found an issue when multiple environments have jobs in the processing queue at the same time, so that could be a thing to avoid if you are experiencing any issues at the moment.
Next update at 19.00 CET at the latest.
Affected services
Processing
Updated
May 07 at 02:59pm CEST
We are still investigating the root cause of the issue. We have identified that a limited number of view processing jobs is being wrongfully discarded.
Affected services
Processing
Created
May 07 at 01:42pm CEST
We are currently investigating reports of missing view generations in the processing layer.
We will update this page as we know more.
Affected services
Processing