Mackerel blog #mackerelio

The Official Blog of Mackerel

【Advanced release notice】Added feature to obtain closed alerts, along with changes to incompatible specifications for the alert acquisition API

Mackerel Director id:daiksy here.

This week, we will be adding a feature to obtain closed alerts to the alert acquisition API. This feature was highly requested. Along with this new feature, we will be making changes to specifications incompatible with the alert acquisition API.

mackerel.io

The extent of impact caused by the specification changes is as follows.

  • Users who currently obtain an open alert list via API
    • Alerts will not be obtainable if more than 101 alerts are open at the same time.
    • If you use mkr to get an alert list and update in advance, there will be no effect due to the fact that mkr 0.34.2 and later will be cared for internally.

The release date and contents of the specification changes are as follows.

  • Release date: November 29, 2018 (Thursday)
  • Release content: Addition of withClosed limit nextId parameter to the alert acquisition API to get closed alerts
  • Change content: The default limit is 100, so if the limit is not set, the maximum number of retrievals will be 100. This also applies to obtaining open alerts in regards to similar requests before specification changes.

Since the number of closed alerts is cumulative, this number can grow to be enormous for organizations that have been operating for a long time. Therefore, it has became necessary to add some sort of paging process for obtaining closed alerts. Because adding this process will affect the behavior of the existing API, we’ve will notify you in advance regarding these specification changes.

Regarding mkr and mackerel-client-go

Regarding mkr, please be aware that if you’re using mkr 0.33.0 or a previous version after November 29th, you will only be able to obtain up to 100 alerts. As versions of mkr and mackerel-client-go that support the new API have already been provided, please update as necessary.

Thank you for choosing Mackerel.

The newly renovated Custom Dashboards and more

Hello! Mackerel team CRE Miura (id:missasan) here.

Last week we announced the Mackerel Advent Calendar, but this year, Inoue (id:a-know) and I are trying our hands at a CRE team specific calendar as well, the Mackerel Advent Calendar 2018 (CRE)!

Now, not only do you have Mackerel Advent Calendar to look forward to, but our calendar as well!

qiita.com (Japanese only)

Now on to this week’s update information.

The newly renovated Custom Dashboards

Custom Dashboards is a dashboard feature that lets you freely arrange graphs that you want to see based on usage scenes etc.

With this update, Custom Dashboards has been newly reborn. Now it’s even easier to create and edit dashboards with more flexibility, add graphs by dragging and dropping, change display sizes and positioning, and more.

3 types of widgets

The types of addable information (widgets) are increasing.

Graph widgets

Various graphs can be displayed. Expression graphs can also be created and added from here.

Value widgets

Display the latest values of various metrics in numbers. You can also select expressions for metric type.

Markdown widgets

As with previous Custom Dashboards feature, you can freely write content in Markdown format.

Use it in various scenes

Dashboards can be created easily and put to good use in various scenes of server monitoring and operation!

  • Daily service status checks
  • Reference at weekly/monthly team meetings
  • Use in capacity planning
  • Look back on the system’s status and effects when the campaign was implemented

Automatic registration using the API

Even with the new Custom Dashboards, it’s possible to import and export graphs using the API. This lets you automate the creating / editing of custom dashboards.

For more details, check out the following document.

mackerel.io

Please note that operations using the mkr command for the new Custom Dashboards are not yet supported.

The old Custom Dashboards

With this change, the old Custom Dashboards has been renamed "Legacy Custom Dashboards". You can also browse previously created dashboards here.

We recommend that you use the new Custom Dashboards feature when creating a new dashboard.

check-redis subcommand replication added

replication has been added to the subcommands of check-redis. This makes it possible to check whether Redis replication is working properly. slave is a similar subcommand, but understand that it will become obsolete in the future.

Organization names can now be obtained with mkr org command

The organization name can now be obtained by running mkr org. The following execution results are obtained.

{
  "name": <name>
}

A practical DevOps Hands-on Workshop 〜 Building a safe/secure DevOps environment with AWS and Mackerel!

This is an event announcement.

This event is a hands-on workshop for beginners to Mackerel and the AWS Code series. If you’d like some practical experience with DevOps environments that combine monitoring and CI/CD pipelines, please come and join us!

Mackerel team CRE Inoue (id:a-know)will be presenting at the event!

dev.classmethod.jp (Japanese only)

Event details

  • Date and time:December 10, 2018 (Monday) 2:00 - 4:30 p.m. (Reception start: 1:30 p.m.)
  • Location:Shibuya Hikarie 11th Floor Sky Lobby Conference Room D [MAP]
  • Capacity:24 people
  • Cost:Free
  • Co-sponsors:Classmethod Inc., Hatena Co., Ltd

API Gateway now supported with AWS Integration and more

Hello! Mackerel team CRE Miura (id:missasan) here.

November has finally come and winter is steadily approaching. And with this time of the year, comes the advent calendar season. I’m sure that everyone is watching and waiting to decide which events to participate in this year. Even here at Mackerel, our annual advent calendar is in the works.

qiita.com

Here are a few of Mackerel’s advent calendars from past years.

The advent calendar was originally a calendar used to count down the number of days until Christmas, looking forward to and enjoying each day by opening up little windows to find small candies or chocolates. I’m excited about the fact that is year’s Mackerel advent calendar will be one that gives nice little gifts to all our users.

If you’ve tried out Mackerel for a year, we'd love to hear what made you happy or some things you struggled with. If you’ve never made an advent calendar before, why not make your debut with Mackerel? By all means!

Now on to this week’s update information.

API Gateway now supported with AWS Integration

Following CloudFront, is support for API Gateway. Refer to the following link for details regarding obtainable metrics.

mackerel.io

AWS Integration features are being released one after another. If you haven’t changed your integration settings in a while, be sure to take a look at this in review.

ALB now supported with mackerel-plugin-aws-waf

Up until now, only AWS WAF metrics deployed to CloudFront were targets with mackerel-plugin-aws-waf, but with this update it’s now possible to obtain ALB targeted metrics as well.

github.com

This is a feature made possible by user contributions. Thank you very much!

A posting limit has been set for API service metric posting

Mackerel’s time series data has a granularity of 1 minute, and any metrics posted at a higher frequency than that are overwritten. With this update, we’ve set a limit on the number of service metric postings by API.

If the limited number of posts for each endpoint is exceeded, status 429 is returned. Be careful not to set a posting frequency below once per minute.

In addition, it’s also possible for the service metric posting API to send multiple metrics at once. By sending multiple metrics together instead of one by one, you can avoid posting restrictions on the API. For more details, refer to the help page.

Monitoring solution seminar featuring: Cloud Portal x SIOS Coati x and Mackerel!

Sony Network Communications (Cloud Portal), SIOS Technology (SIOS Coati), and your very own Hatena (Mackerel) are holding a seminar together.

Event outline

  • Date and time: Wednesday, December 12th, 15:00-17:30
  • Location: Akihabara UDX 4F Next-2 (2 min walk from Akihabara Station)
  • Capacity: 80 people

Sign up here

www.bit-drive.ne.jp (Japanese only)

A new feature has been added to display maintenance and incident information from the management screen etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

We recently released a new case study featuring SEGA Games and their use of Mackerel in social network gaming environments. This article goes over the introduction of Mackerel motivated by transitioning to the cloud and how it’s used in daily operation. Be sure to check it out!

Now on to this week’s update information.

A new feature has been added to display maintenance and incident information from the management screen

When Mackerel is under maintenance or if an incident has occurred, a message will now be displayed at the top of the screen as shown in the below images.

During maintenance

During an incident

Now it’s possible to see information from Mackerel’s status page in a more convenient way. Use this to eliminate false alarms. For more detailed information, continue to check the status page and follow our official Twitter timeline.

A plugin name can now be set to User-Agent for plugins that send http requests

For plugins that send http requests, a plugin name such as mackerel-plugin-plack can now be set to the User-Agent at the time of request. Up until now, the standard User-Agent of Go was used.

Now, we can easily distinguish where the request was issued with User-Agent when viewing the access log. Please be cautious when using User-Agent for access restrictions etc.

SEGA Games case study!

Check out our latest case study through the link below!

mackerel.io (Japanese only)

A detailed report on the incident that occurred on 9/26 (Wed) and our subsequent response

This is a detailed report regarding the incident that occurred on Wednesday (9/26) and our response following the event.

Time of the incident

  • Time of the incident: 2018/09/26 10:51-15:20 (JST)
  • Events that occurred: Malfunction of the Mackerel system and suspension of connectivity monitoring

Event timeline (JST)

10:51 Redis failover and malfunction

As we’ve seen a trend of increased memory use in Redis, which is used for monitoring data storage, we implemented an operation to build a replication in order to scale up Redis. However, building the replication took time, and because Redis was unresponsive, it was detected as a node failure by clustering software (keepalived), and this caused an unintended failover.

As a result, connecting to Redis from the application server was unavailable and we weren’t able to respond properly.

10:55 Recovery and continued failure

We switched over to a more appropriate Redis and this temporarily restored the application.

However, the application was unstable, and along with complex factors such as network errors, Redis memory increasing, etc. that occurred before the incident happened, the application server latency deteriorated and a timeout occurred. Afterward, we restarted the application server, but the symptoms did not improve.

We’ve determined this to be the cause of the prolonged failure. The details surrounding this cause are unconfirmed and further investigation into the matter has been canceled.

11:00-14:50 Incident response

The following specific actions were made in response.

  • Detected organizations posting inappropriate metrics and cut off these requests
  • Adjusted the application server timeout interval
  • Temporarily reinforced Diamond(TSDB)
    • Increased the Kinesis shard count / Redis Cluster shard count
  • Scale out the API use application server

Following these actions, we temporarily switched to maintenance mode, and after confirmation was made within the company, we gradually returned requests starting from the outside.

15:20 Recovery confirmation

After confirmation was made that the application server response was stable, metric retransmission from mackerel-agent had completed, and the delay in reflecting metrics data to the TSDB was resolved, then we declared the restoration.

Cause of the incident

This incident was caused by an unexpected failover accompanying the implementation of an operation in Redis. However, we have not been able to accurately identify the reasons behind the subsequent prolonged restoration. The following is a theory that was raised when looking back on the incident.

  • When a specific access pattern continues, a thread pool or connection pool with the same wait period or lock contention may cause latency to deteriorate in a Scala application.

Verifying the theory

In order to confirm the above theory, we attempted to reproduce the situation at the time of the incident in an isolated environment. Unfortunately, we were unsuccessful in reproducing an identical situation.

Future support

As previously mentioned, we are unable to specify the details surrounding the cause of this incident, but we do believe that we can prevent similar long-term failures in the future by implementing the following countermeasures.

Reviewing Redis failover behavior (implemented)

To cope with the failover malfunction which was the cause of all this trouble in the first place, we’ve responded by increasing the number of Redis keepalived health checks so as not to cause a failover with such a short period load increase.

Reinforcing the application (implemented)

We created more of a buffer for application performance. Specifically, the following was done.

  • Scale up work for Redis
  • Increased the number of application servers

Improving the efficiency of monitoring data stored in Redis (implemented)

As a basic response to the amount of memory used in Redis, the application was refurbished, efficiently saving only the necessary monitoring data in Redis, and the Redis memory usage was reduced.

Counteracting improper requests (implemented)

  • Building a mechanism to quickly detect improper API requests
  • Creating a feature to quickly block organizations making improper requests

In the future, we will continue to try and improve the accuracy of detecting improper requests and we are also considering building up other features such as the API Limit etc.

Reinforcing application monitoring

We plan to strengthen the application server’s internal process monitoring and prepare a system that can respond before another similar incident occurs.

Summary

We understand that this incident and its extended was an inconvenience and we sincerely apologize. As we work hard to prevent recurrence, please trust that we will do our best to localize the problem even if a similar incident occurs.

Thank you for choosing Mackerel.

AWS Integration now supports CloudFront

Hello! Mackerel team CRE Miura (id:missasan) here.

Following the recent release of DynamoDB, a new feature has been added for AWS Integration!

AWS Integration now supports CloudFront

Refer to the help page below for more on obtainable metrics.

mackerel.io

This feature is the second of which was co-developed with iret Inc., following the recent release of our DynamoDB integrated feature!

Billable targets are determined using the conversion 1 Distribution = 1 Host. Additionally, since CloudFront is a global service, integration with CloudFront is possible regardless of the region selected in the AWS integration settings.

If you use CloudFront, be sure to enable this feature and give it a try. We welcome your feedback!

Check monitoring plugin for AWS CloudWatch Logs added etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

Mackerel will be attending Cloud Impact 2018 which is scheduled to run from October 17th (Wednesday) through October 19th (Friday) at Tokyo Big Sight. Be sure to stop by the Mackerel booth!

Now on to this week’s update information.

Check monitoring plugin for AWS CloudWatch Logs added

The mackerel-check-plugins package has been updated to v0.23.0. With this update, we’ve added check-aws-cloudwatch-logs, a check plugin for AWS CloudWatch Logs! For more details such as on how to use, check out the help page below.

mackerel.io

Listed below are some additional notes about the plugin.

If you have any points of concern, be sure to submit an issue / pull request to the official Github repository or contact our support team!

BurstBalance metrics have been added for AWS・RDS Integration

General purpose SSD (gp2) volume Burst Balance metrics can now be obtained in AWS · RDS integration. Be sure to give it a try.

Mackerel at Cloud Impact 2018! October 17-19 (Wed. - Fri.)

Details are as follows. We are looking forward to seeing everyone at our booth!

  • Date: October 17 (Wednesday) to October 19 (Friday)
  • Place: Tokyo Big Sight East Hall 1-3
  • Admission fee: 3,000yen (tax included, free for those invited/pre-registered)