A detailed report on the incident that occurred on 9/26 (Wed) and our subsequent response

This is a detailed report regarding the incident that occurred on Wednesday (9/26) and our response following the event.

Time of the incident
Event timeline (JST)
Cause of the incident
- Verifying the theory
Future support
Summary

Time of the incident

Time of the incident: 2018/09/26 10:51-15:20 (JST)
Events that occurred: Malfunction of the Mackerel system and suspension of connectivity monitoring

Event timeline (JST)

10:51 Redis failover and malfunction

As we’ve seen a trend of increased memory use in Redis, which is used for monitoring data storage, we implemented an operation to build a replication in order to scale up Redis. However, building the replication took time, and because Redis was unresponsive, it was detected as a node failure by clustering software (keepalived), and this caused an unintended failover.

As a result, connecting to Redis from the application server was unavailable and we weren’t able to respond properly.

10:55 Recovery and continued failure

We switched over to a more appropriate Redis and this temporarily restored the application.

However, the application was unstable, and along with complex factors such as network errors, Redis memory increasing, etc. that occurred before the incident happened, the application server latency deteriorated and a timeout occurred. Afterward, we restarted the application server, but the symptoms did not improve.

We’ve determined this to be the cause of the prolonged failure. The details surrounding this cause are unconfirmed and further investigation into the matter has been canceled.

11:00-14:50 Incident response

The following specific actions were made in response.

Detected organizations posting inappropriate metrics and cut off these requests
Adjusted the application server timeout interval
Temporarily reinforced Diamond(TSDB)
- Increased the Kinesis shard count / Redis Cluster shard count
Scale out the API use application server

Following these actions, we temporarily switched to maintenance mode, and after confirmation was made within the company, we gradually returned requests starting from the outside.

15:20 Recovery confirmation

After confirmation was made that the application server response was stable, metric retransmission from mackerel-agent had completed, and the delay in reflecting metrics data to the TSDB was resolved, then we declared the restoration.

Cause of the incident

This incident was caused by an unexpected failover accompanying the implementation of an operation in Redis. However, we have not been able to accurately identify the reasons behind the subsequent prolonged restoration. The following is a theory that was raised when looking back on the incident.

When a specific access pattern continues, a thread pool or connection pool with the same wait period or lock contention may cause latency to deteriorate in a Scala application.

Verifying the theory

In order to confirm the above theory, we attempted to reproduce the situation at the time of the incident in an isolated environment. Unfortunately, we were unsuccessful in reproducing an identical situation.

Future support

As previously mentioned, we are unable to specify the details surrounding the cause of this incident, but we do believe that we can prevent similar long-term failures in the future by implementing the following countermeasures.

Reviewing Redis failover behavior (implemented)

To cope with the failover malfunction which was the cause of all this trouble in the first place, we’ve responded by increasing the number of Redis keepalived health checks so as not to cause a failover with such a short period load increase.

Reinforcing the application (implemented)

We created more of a buffer for application performance. Specifically, the following was done.

Scale up work for Redis
Increased the number of application servers

Improving the efficiency of monitoring data stored in Redis (implemented)

As a basic response to the amount of memory used in Redis, the application was refurbished, efficiently saving only the necessary monitoring data in Redis, and the Redis memory usage was reduced.

Counteracting improper requests (implemented)

Building a mechanism to quickly detect improper API requests
- visualized in the dashboard
Creating a feature to quickly block organizations making improper requests

In the future, we will continue to try and improve the accuracy of detecting improper requests and we are also considering building up other features such as the API Limit etc.

Reinforcing application monitoring

We plan to strengthen the application server’s internal process monitoring and prepare a system that can respond before another similar incident occurs.

Summary

We understand that this incident and its extended was an inconvenience and we sincerely apologize. As we work hard to prevent recurrence, please trust that we will do our best to localize the problem even if a similar incident occurs.

Thank you for choosing Mackerel.