Mackerel blog #mackerelio

The Official Blog of Mackerel

A detailed report on the incident that occurred on 9/26 (Wed) and our subsequent response

This is a detailed report regarding the incident that occurred on Wednesday (9/26) and our response following the event.

Time of the incident

  • Time of the incident: 2018/09/26 10:51-15:20 (JST)
  • Events that occurred: Malfunction of the Mackerel system and suspension of connectivity monitoring

Event timeline (JST)

10:51 Redis failover and malfunction

As we’ve seen a trend of increased memory use in Redis, which is used for monitoring data storage, we implemented an operation to build a replication in order to scale up Redis. However, building the replication took time, and because Redis was unresponsive, it was detected as a node failure by clustering software (keepalived), and this caused an unintended failover.

As a result, connecting to Redis from the application server was unavailable and we weren’t able to respond properly.

10:55 Recovery and continued failure

We switched over to a more appropriate Redis and this temporarily restored the application.

However, the application was unstable, and along with complex factors such as network errors, Redis memory increasing, etc. that occurred before the incident happened, the application server latency deteriorated and a timeout occurred. Afterward, we restarted the application server, but the symptoms did not improve.

We’ve determined this to be the cause of the prolonged failure. The details surrounding this cause are unconfirmed and further investigation into the matter has been canceled.

11:00-14:50 Incident response

The following specific actions were made in response.

  • Detected organizations posting inappropriate metrics and cut off these requests
  • Adjusted the application server timeout interval
  • Temporarily reinforced Diamond(TSDB)
    • Increased the Kinesis shard count / Redis Cluster shard count
  • Scale out the API use application server

Following these actions, we temporarily switched to maintenance mode, and after confirmation was made within the company, we gradually returned requests starting from the outside.

15:20 Recovery confirmation

After confirmation was made that the application server response was stable, metric retransmission from mackerel-agent had completed, and the delay in reflecting metrics data to the TSDB was resolved, then we declared the restoration.

Cause of the incident

This incident was caused by an unexpected failover accompanying the implementation of an operation in Redis. However, we have not been able to accurately identify the reasons behind the subsequent prolonged restoration. The following is a theory that was raised when looking back on the incident.

  • When a specific access pattern continues, a thread pool or connection pool with the same wait period or lock contention may cause latency to deteriorate in a Scala application.

Verifying the theory

In order to confirm the above theory, we attempted to reproduce the situation at the time of the incident in an isolated environment. Unfortunately, we were unsuccessful in reproducing an identical situation.

Future support

As previously mentioned, we are unable to specify the details surrounding the cause of this incident, but we do believe that we can prevent similar long-term failures in the future by implementing the following countermeasures.

Reviewing Redis failover behavior (implemented)

To cope with the failover malfunction which was the cause of all this trouble in the first place, we’ve responded by increasing the number of Redis keepalived health checks so as not to cause a failover with such a short period load increase.

Reinforcing the application (implemented)

We created more of a buffer for application performance. Specifically, the following was done.

  • Scale up work for Redis
  • Increased the number of application servers

Improving the efficiency of monitoring data stored in Redis (implemented)

As a basic response to the amount of memory used in Redis, the application was refurbished, efficiently saving only the necessary monitoring data in Redis, and the Redis memory usage was reduced.

Counteracting improper requests (implemented)

  • Building a mechanism to quickly detect improper API requests
  • Creating a feature to quickly block organizations making improper requests

In the future, we will continue to try and improve the accuracy of detecting improper requests and we are also considering building up other features such as the API Limit etc.

Reinforcing application monitoring

We plan to strengthen the application server’s internal process monitoring and prepare a system that can respond before another similar incident occurs.

Summary

We understand that this incident and its extended was an inconvenience and we sincerely apologize. As we work hard to prevent recurrence, please trust that we will do our best to localize the problem even if a similar incident occurs.

Thank you for choosing Mackerel.

AWS Integration now supports CloudFront

Hello! Mackerel team CRE Miura (id:missasan) here.

Following the recent release of DynamoDB, a new feature has been added for AWS Integration!

AWS Integration now supports CloudFront

Refer to the help page below for more on obtainable metrics.

mackerel.io

This feature is the second of which was co-developed with iret Inc., following the recent release of our DynamoDB integrated feature!

Billable targets are determined using the conversion 1 Distribution = 1 Host. Additionally, since CloudFront is a global service, integration with CloudFront is possible regardless of the region selected in the AWS integration settings.

If you use CloudFront, be sure to enable this feature and give it a try. We welcome your feedback!

Check monitoring plugin for AWS CloudWatch Logs added etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

Mackerel will be attending Cloud Impact 2018 which is scheduled to run from October 17th (Wednesday) through October 19th (Friday) at Tokyo Big Sight. Be sure to stop by the Mackerel booth!

Now on to this week’s update information.

Check monitoring plugin for AWS CloudWatch Logs added

The mackerel-check-plugins package has been updated to v0.23.0. With this update, we’ve added check-aws-cloudwatch-logs, a check plugin for AWS CloudWatch Logs! For more details such as on how to use, check out the help page below.

mackerel.io

Listed below are some additional notes about the plugin.

If you have any points of concern, be sure to submit an issue / pull request to the official Github repository or contact our support team!

BurstBalance metrics have been added for AWS・RDS Integration

General purpose SSD (gp2) volume Burst Balance metrics can now be obtained in AWS · RDS integration. Be sure to give it a try.

Mackerel at Cloud Impact 2018! October 17-19 (Wed. - Fri.)

Details are as follows. We are looking forward to seeing everyone at our booth!

  • Date: October 17 (Wednesday) to October 19 (Friday)
  • Place: Tokyo Big Sight East Hall 1-3
  • Admission fee: 3,000yen (tax included, free for those invited/pre-registered)

AWS Integration now supports DynamoDB etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

This week, a new feature has been added for AWS integration. The long-awaited DynamoDB is now supported.

Now on to the latest update information.

AWS Integration now supports DynamoDB

Refer to the help page below for more on obtainable metrics.

mackerel.io

This feature was co-developed together with iret Inc., a development firm with abundant AWS operational knowledge. iret Inc. offers the cloudpack service, which provides fully managed services for a variety of AWS products. iret, thank you for all your help!

Webhook can now be registered with notification channel APIs

In addition to email and Slack notifications, it is now possible to register Webhook notification channels using the API. For more details, check out the notification channel API document below.

mackerel.io

API key clipboard added to the GUI installation procedure for Windows servers when registering a new host

In the “Register a new Host” screen that can be accessed from the left sidebar menu, an API key clipboard was added to the GUI installation procedure for Windows servers.

You can copy the API key from the same screen as seen below.

If you have any ideas or points of improvement regarding how to install mackerel-agent on Windows servers, we would gladly welcome your feedback.

Regarding the incident that occurred on September 26, 2018 (Wed.)

Thank you for choosing Mackerel.

This is an announcement to report on the incident that occurred today (9/26).

Today at 10:51 am (JST), the API server error rate increased and became unstable.

In terms of this phenomenon, access to the API server failed, and a 5xx status code was most likely returned resulting in an error.

As the API server error rate increased, connectivity monitoring was suspended in order to prevent false reports.

After that, the unstable conditions continued for an extended period of time. At 4:20 pm (JST), recovery measures were taken by adjusting application parameters and reinforcing the server.

We were not able to identify the direct cause and will continue to further investigate this issue. Additionally, starting tomorrow, operations will be implemented to prevent secondary issues from occurring. Please note that depending on the operation, we may temporarily switch to maintenance mode (restricted access to the server).

We apologize for any inconvenience this issue may have caused.

Thank you for your cooperation.

loadavg1 and loadavg15 added to system metrics etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

ISUCON8 was held last weekend and Mackerel team members Matsuki (id:Songmu) and Shibasaki(id:id:shiba_yu36) both made it through the qualifying round. I’m looking forward to the main event.

Now on to the latest update information.

loadavg1 and loadavg15 added to system metrics

With the release of mackerel-agent v0.57.0, loadavg1 and loadavg15 have been added to the loadavg graph which previously only supported loadavg5. Now you can conveniently compare loadavg5 and loadavg15 to check whether the CPU load is increasing or decreasing. When updating mackerel-agent to the latest version, two system metric items will be added to the target host.

This update is compatible with all platforms except Windows Server.

The log rotation tracking accuracy for check-log plugin has been improved

With the release of go-check-plugins v0.22.1, a change was made to the inode number for when tracking log files with the check-log plugin. With this, the tracking accuracy for log files when logs are rotated has improved.

You can now specify the number of redirects with check-http

With the release of go-check-plugins v0.22.1, you can now specify the number of follow up redirects with the option --max-redirects (the default is 10).

Thank you for your contributions!

Null values changed to 0 for AWS ALB/ELB RequestCount metrics

In environments where ALB / ELB metrics are obtained with the AWS Integration feature, a change was made to now post 0 if the RequestCount metric value obtained from CloudWatch is null. This fixes the problem of alerts not closing automatically, which was previously caused by the null value.

Code signing certificate for Windows Server installer updated

With the release of mackerel-agent v 0.57.0, the code signing certificate for Windows Server installer has been updated. Please note that if you use the previous version of the installer, a certificate expiration warning will occur.

【Summer student intern feature release!】An Organizations list screen has been added etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

Today is the last day of the 2018 Hatena Summer Internship program. This month went by so fast!

developer.hatenastaff.com (Japanese only)

For the second half of the program, the student interns are assigned to each team and work on task assignments and feature developments that will actually be incorporated into the service. Two student interns were also assigned to the Mackerel team and were challenged with a lot of issues.

The feature that was introduced last Monday (9/3) titled "Roles can now be registered/deleted from the API" was implemented and released by our two student interns.

mackerel.io

Since it’s the last day of the summer internship, we are going to introduce a complete set of the features implemented by our student interns.

Here is a message from Mackerel team director Katsuya (id:daiksy) who helped mentor the student interns over this last month.

This is the fourth year that the Mackerel team has received student interns. And the speed of development has been outstanding this year. On more than one occasion, I was surprised checking GitHub after a meeting like, "What!? This feature is already up for review ??".

After two weeks of development, today is the last day of this year’s internship. I think that this was a good experience for the student interns, but they also inspired the team as well. It was a very fulfilling two weeks.

It truly was a surprise that so many new features were developed and released in such a short period of time.

Now on to the update information.

An Organizations list screen has been added

View a list of the organizations that you belong to by accessing the URL below.

https://mackerel.io/orgs/

You can also access the same screen by clicking [▼] next to the [Organization Name] on the left side menu and clicking [Organizations].

From this page you can see the number of services, hosts, members, and alerts that are currently occurring for each organization. If you belong to multiple organizations, you can use this list to see the whole picture, like when confirming for which organization an alert is occurring.

API added to post metadata for Services/Roles

Up until now, you could register metadata to hosts, but with this release, you can now register arbitrary JSON data as metadata for services/roles. For more details, refer to the Mackerel API document for Metadata.

API added to obtain monitor settings by specifying an ID

We’ve added an API that allows you to specify the target monitor setting ID and obtain settings information. The specification method is /api/v0/monitors/<monitorId>.

For more details, refer to the the Mackerel API document on Monitors.

API added to register/delete notification channels

This API is currently only supported for email notifications and Slack. With this release, information that can be obtained with the notification channel list API can now be obtained in more detail with email notifications and Slack.

For more information, refer to the Mackerel API document on Notification Channels.

Multiple services can now be selected by filtering the Alerts list screen

It is now possible to specify multiple services with OR conditions when filtering displayed alerts in the Alerts list screen.

Multiple services can now be selected by filtering the Hosts list screen

As with the Alerts list, it is now possible to specify multiple services with OR when filtering the display in the Hosts list screen.

Thank you to all our summer interns for your hard work!