Mackerel blog #mackerelio

The Official Blog of Mackerel

API Gateway now supported with AWS Integration and more

Hello! Mackerel team CRE Miura (id:missasan) here.

November has finally come and winter is steadily approaching. And with this time of the year, comes the advent calendar season. I’m sure that everyone is watching and waiting to decide which events to participate in this year. Even here at Mackerel, our annual advent calendar is in the works.

qiita.com

Here are a few of Mackerel’s advent calendars from past years.

The advent calendar was originally a calendar used to count down the number of days until Christmas, looking forward to and enjoying each day by opening up little windows to find small candies or chocolates. I’m excited about the fact that is year’s Mackerel advent calendar will be one that gives nice little gifts to all our users.

If you’ve tried out Mackerel for a year, we'd love to hear what made you happy or some things you struggled with. If you’ve never made an advent calendar before, why not make your debut with Mackerel? By all means!

Now on to this week’s update information.

API Gateway now supported with AWS Integration

Following CloudFront, is support for API Gateway. Refer to the following link for details regarding obtainable metrics.

mackerel.io

AWS Integration features are being released one after another. If you haven’t changed your integration settings in a while, be sure to take a look at this in review.

ALB now supported with mackerel-plugin-aws-waf

Up until now, only AWS WAF metrics deployed to CloudFront were targets with mackerel-plugin-aws-waf, but with this update it’s now possible to obtain ALB targeted metrics as well.

github.com

This is a feature made possible by user contributions. Thank you very much!

A posting limit has been set for API service metric posting

Mackerel’s time series data has a granularity of 1 minute, and any metrics posted at a higher frequency than that are overwritten. With this update, we’ve set a limit on the number of service metric postings by API.

If the limited number of posts for each endpoint is exceeded, status 429 is returned. Be careful not to set a posting frequency below once per minute.

In addition, it’s also possible for the service metric posting API to send multiple metrics at once. By sending multiple metrics together instead of one by one, you can avoid posting restrictions on the API. For more details, refer to the help page.

Monitoring solution seminar featuring: Cloud Portal x SIOS Coati x and Mackerel!

Sony Network Communications (Cloud Portal), SIOS Technology (SIOS Coati), and your very own Hatena (Mackerel) are holding a seminar together.

Event outline

  • Date and time: Wednesday, December 12th, 15:00-17:30
  • Location: Akihabara UDX 4F Next-2 (2 min walk from Akihabara Station)
  • Capacity: 80 people

Sign up here

www.bit-drive.ne.jp (Japanese only)

A new feature has been added to display maintenance and incident information from the management screen etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

We recently released a new case study featuring SEGA Games and their use of Mackerel in social network gaming environments. This article goes over the introduction of Mackerel motivated by transitioning to the cloud and how it’s used in daily operation. Be sure to check it out!

Now on to this week’s update information.

A new feature has been added to display maintenance and incident information from the management screen

When Mackerel is under maintenance or if an incident has occurred, a message will now be displayed at the top of the screen as shown in the below images.

During maintenance

During an incident

Now it’s possible to see information from Mackerel’s status page in a more convenient way. Use this to eliminate false alarms. For more detailed information, continue to check the status page and follow our official Twitter timeline.

A plugin name can now be set to User-Agent for plugins that send http requests

For plugins that send http requests, a plugin name such as mackerel-plugin-plack can now be set to the User-Agent at the time of request. Up until now, the standard User-Agent of Go was used.

Now, we can easily distinguish where the request was issued with User-Agent when viewing the access log. Please be cautious when using User-Agent for access restrictions etc.

SEGA Games case study!

Check out our latest case study through the link below!

mackerel.io (Japanese only)

A detailed report on the incident that occurred on 9/26 (Wed) and our subsequent response

This is a detailed report regarding the incident that occurred on Wednesday (9/26) and our response following the event.

Time of the incident

  • Time of the incident: 2018/09/26 10:51-15:20 (JST)
  • Events that occurred: Malfunction of the Mackerel system and suspension of connectivity monitoring

Event timeline (JST)

10:51 Redis failover and malfunction

As we’ve seen a trend of increased memory use in Redis, which is used for monitoring data storage, we implemented an operation to build a replication in order to scale up Redis. However, building the replication took time, and because Redis was unresponsive, it was detected as a node failure by clustering software (keepalived), and this caused an unintended failover.

As a result, connecting to Redis from the application server was unavailable and we weren’t able to respond properly.

10:55 Recovery and continued failure

We switched over to a more appropriate Redis and this temporarily restored the application.

However, the application was unstable, and along with complex factors such as network errors, Redis memory increasing, etc. that occurred before the incident happened, the application server latency deteriorated and a timeout occurred. Afterward, we restarted the application server, but the symptoms did not improve.

We’ve determined this to be the cause of the prolonged failure. The details surrounding this cause are unconfirmed and further investigation into the matter has been canceled.

11:00-14:50 Incident response

The following specific actions were made in response.

  • Detected organizations posting inappropriate metrics and cut off these requests
  • Adjusted the application server timeout interval
  • Temporarily reinforced Diamond(TSDB)
    • Increased the Kinesis shard count / Redis Cluster shard count
  • Scale out the API use application server

Following these actions, we temporarily switched to maintenance mode, and after confirmation was made within the company, we gradually returned requests starting from the outside.

15:20 Recovery confirmation

After confirmation was made that the application server response was stable, metric retransmission from mackerel-agent had completed, and the delay in reflecting metrics data to the TSDB was resolved, then we declared the restoration.

Cause of the incident

This incident was caused by an unexpected failover accompanying the implementation of an operation in Redis. However, we have not been able to accurately identify the reasons behind the subsequent prolonged restoration. The following is a theory that was raised when looking back on the incident.

  • When a specific access pattern continues, a thread pool or connection pool with the same wait period or lock contention may cause latency to deteriorate in a Scala application.

Verifying the theory

In order to confirm the above theory, we attempted to reproduce the situation at the time of the incident in an isolated environment. Unfortunately, we were unsuccessful in reproducing an identical situation.

Future support

As previously mentioned, we are unable to specify the details surrounding the cause of this incident, but we do believe that we can prevent similar long-term failures in the future by implementing the following countermeasures.

Reviewing Redis failover behavior (implemented)

To cope with the failover malfunction which was the cause of all this trouble in the first place, we’ve responded by increasing the number of Redis keepalived health checks so as not to cause a failover with such a short period load increase.

Reinforcing the application (implemented)

We created more of a buffer for application performance. Specifically, the following was done.

  • Scale up work for Redis
  • Increased the number of application servers

Improving the efficiency of monitoring data stored in Redis (implemented)

As a basic response to the amount of memory used in Redis, the application was refurbished, efficiently saving only the necessary monitoring data in Redis, and the Redis memory usage was reduced.

Counteracting improper requests (implemented)

  • Building a mechanism to quickly detect improper API requests
  • Creating a feature to quickly block organizations making improper requests

In the future, we will continue to try and improve the accuracy of detecting improper requests and we are also considering building up other features such as the API Limit etc.

Reinforcing application monitoring

We plan to strengthen the application server’s internal process monitoring and prepare a system that can respond before another similar incident occurs.

Summary

We understand that this incident and its extended was an inconvenience and we sincerely apologize. As we work hard to prevent recurrence, please trust that we will do our best to localize the problem even if a similar incident occurs.

Thank you for choosing Mackerel.

AWS Integration now supports CloudFront

Hello! Mackerel team CRE Miura (id:missasan) here.

Following the recent release of DynamoDB, a new feature has been added for AWS Integration!

AWS Integration now supports CloudFront

Refer to the help page below for more on obtainable metrics.

mackerel.io

This feature is the second of which was co-developed with iret Inc., following the recent release of our DynamoDB integrated feature!

Billable targets are determined using the conversion 1 Distribution = 1 Host. Additionally, since CloudFront is a global service, integration with CloudFront is possible regardless of the region selected in the AWS integration settings.

If you use CloudFront, be sure to enable this feature and give it a try. We welcome your feedback!

Check monitoring plugin for AWS CloudWatch Logs added etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

Mackerel will be attending Cloud Impact 2018 which is scheduled to run from October 17th (Wednesday) through October 19th (Friday) at Tokyo Big Sight. Be sure to stop by the Mackerel booth!

Now on to this week’s update information.

Check monitoring plugin for AWS CloudWatch Logs added

The mackerel-check-plugins package has been updated to v0.23.0. With this update, we’ve added check-aws-cloudwatch-logs, a check plugin for AWS CloudWatch Logs! For more details such as on how to use, check out the help page below.

mackerel.io

Listed below are some additional notes about the plugin.

If you have any points of concern, be sure to submit an issue / pull request to the official Github repository or contact our support team!

BurstBalance metrics have been added for AWS・RDS Integration

General purpose SSD (gp2) volume Burst Balance metrics can now be obtained in AWS · RDS integration. Be sure to give it a try.

Mackerel at Cloud Impact 2018! October 17-19 (Wed. - Fri.)

Details are as follows. We are looking forward to seeing everyone at our booth!

  • Date: October 17 (Wednesday) to October 19 (Friday)
  • Place: Tokyo Big Sight East Hall 1-3
  • Admission fee: 3,000yen (tax included, free for those invited/pre-registered)

AWS Integration now supports DynamoDB etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

This week, a new feature has been added for AWS integration. The long-awaited DynamoDB is now supported.

Now on to the latest update information.

AWS Integration now supports DynamoDB

Refer to the help page below for more on obtainable metrics.

mackerel.io

This feature was co-developed together with iret Inc., a development firm with abundant AWS operational knowledge. iret Inc. offers the cloudpack service, which provides fully managed services for a variety of AWS products. iret, thank you for all your help!

Webhook can now be registered with notification channel APIs

In addition to email and Slack notifications, it is now possible to register Webhook notification channels using the API. For more details, check out the notification channel API document below.

mackerel.io

API key clipboard added to the GUI installation procedure for Windows servers when registering a new host

In the “Register a new Host” screen that can be accessed from the left sidebar menu, an API key clipboard was added to the GUI installation procedure for Windows servers.

You can copy the API key from the same screen as seen below.

If you have any ideas or points of improvement regarding how to install mackerel-agent on Windows servers, we would gladly welcome your feedback.

Regarding the incident that occurred on September 26, 2018 (Wed.)

Thank you for choosing Mackerel.

This is an announcement to report on the incident that occurred today (9/26).

Today at 10:51 am (JST), the API server error rate increased and became unstable.

In terms of this phenomenon, access to the API server failed, and a 5xx status code was most likely returned resulting in an error.

As the API server error rate increased, connectivity monitoring was suspended in order to prevent false reports.

After that, the unstable conditions continued for an extended period of time. At 4:20 pm (JST), recovery measures were taken by adjusting application parameters and reinforcing the server.

We were not able to identify the direct cause and will continue to further investigate this issue. Additionally, starting tomorrow, operations will be implemented to prevent secondary issues from occurring. Please note that depending on the operation, we may temporarily switch to maintenance mode (restricted access to the server).

We apologize for any inconvenience this issue may have caused.

Thank you for your cooperation.