Mackerel blog #mackerelio

The Official Blog of Mackerel

Communication via TLS 1.1 will stop on January 8th, 2019 (Tue.)

Thank you for choosing Mackerel.

As stated in the title, on Tuesday, January 8th, 2019, encrypted communication via TLS 1.1 will be stopped.

Impact range

Browsing mackerel.io will no longer be available from older OSs and browsers that do not support TLS 1.2 or later.

We recommend using our services in officially supported environments*1 applied with the latest patches etc. If TLS 1.2 or later is not yet supported, please transition to a supported environment.

We apologize for any inconvenience and we appreciate your understanding of the importance of providing our services in a secure environment.

Thank you for choosing Mackerel.

AWS Integration now supported with Kinesis Data Streams and more

Hello! Mackerel team CRE Miura (id:missasan) here.

Only a few days left in November. How the time flies. The end of the year is finally upon us. Do you already have plans for this holiday season? Mackerel Drink Up will be held on December 11th (Tuesday). As we still have openings for the LT, if you or someone you know has some material to announce before the New Year, why not do so at Mackerel Drink Up?

Now on to this week’s update information.

【Advance release notice】Incompatible specification changes for the alert acquisition API

With the release on November 29th, 2018 (Thursday), a feature will be added to obtain closed alerts with the API, along with changes made to specifications incompatible with the alert acquisition API. Please see the blog post below for more details regarding the content of the changes etc.

mackerel.io

AWS Integration now supported with Kinesis Data Streams

For detail regarding obtainable metrics, refer to the help page below.

mackerel.io

Host memos can now be included in text alert emails

Host memos can now be included in text alert email. (also previously included in HTML mail) You can add memos from the host detail screen (https://mackerel.io/my/hosts/【hostId】) shown in the image below. Put this feature to good use by writing down things to watch out for or handy details that can help during troubleshooting, etc.

The problem of not being able to obtain disk metrics is now fixed with Linux kernel version 4.19 or later in mackerel-agent v 0.58.2

Update mackerel-agent to the latest version and give it a try.

--connect-to option added to check-http

The --connect-to option added with this update is useful if you’d like to monitor whether the response being made when directly responding HTTPS with an HTTP application server or TLS terminal reverse proxy is being made with the correct certificate corresponding to the specified domain. Refer to the following link for usage options.

github.com

Improvements made to mkr command in preparation for the release of API to obtain closed alerts

We’ve made some upgrades in order to be compatible with the scheduled release of the closed alert acquisition API as mentioned in the 【Advance release announcement】 above. If you’re running the alert API with mkr command, please consider updating with the latest release. The feature to obtain closed alerts (--with-closed option) will be available after the API feature release.

Mackerel Drink Up #8 Tokyo!

Mackerel Drink Up will be held at Hatena's Tokyo office. And although we’ve already received quite a lot of applications, LT slots are still available. Come join us.

mackerelio.connpass.com (Japanese only)

Event details

  • Date and time:December 11, 2018 (Tue) 7:00 pm ~ 9:00 pm (Reception starts at 6:45 pm)
  • Location:Hatena Tokyo office(3rd floor, Seminar room)[MAP]
  • Capacity:15 people, LT - 3 people
  • Cost:Free

Our support window will be closed for the holidays

Both ways of contacting our support team, either through support@mackerel.io or with the “contact our support team” option displayed in the upper right header when logged into Mackerel, will be closed during the following dates for the New Year’s holidays.

New Year’s holiday period :Tuesday, December 25th, 2018 - Thursday, January 3rd, 2019

All inquiries received during this period will be handled sequentially starting after the holidays.

【Advanced release notice】Added feature to obtain closed alerts, along with changes to incompatible specifications for the alert acquisition API

Mackerel Director id:daiksy here.

This week, we will be adding a feature to obtain closed alerts to the alert acquisition API. This feature was highly requested. Along with this new feature, we will be making changes to specifications incompatible with the alert acquisition API.

mackerel.io

The extent of impact caused by the specification changes is as follows.

  • Users who currently obtain an open alert list via API
    • Alerts will not be obtainable if more than 101 alerts are open at the same time.
    • If you use mkr to get an alert list and update in advance, there will be no effect due to the fact that mkr 0.34.2 and later will be cared for internally.

The release date and contents of the specification changes are as follows.

  • Release date: November 29, 2018 (Thursday)
  • Release content: Addition of withClosed limit nextId parameter to the alert acquisition API to get closed alerts
  • Change content: The default limit is 100, so if the limit is not set, the maximum number of retrievals will be 100. This also applies to obtaining open alerts in regards to similar requests before specification changes.

Since the number of closed alerts is cumulative, this number can grow to be enormous for organizations that have been operating for a long time. Therefore, it has became necessary to add some sort of paging process for obtaining closed alerts. Because adding this process will affect the behavior of the existing API, we’ve will notify you in advance regarding these specification changes.

Regarding mkr and mackerel-client-go

Regarding mkr, please be aware that if you’re using mkr 0.33.0 or a previous version after November 29th, you will only be able to obtain up to 100 alerts. As versions of mkr and mackerel-client-go that support the new API have already been provided, please update as necessary.

Thank you for choosing Mackerel.

The newly renovated Custom Dashboards and more

Hello! Mackerel team CRE Miura (id:missasan) here.

Last week we announced the Mackerel Advent Calendar, but this year, Inoue (id:a-know) and I are trying our hands at a CRE team specific calendar as well, the Mackerel Advent Calendar 2018 (CRE)!

Now, not only do you have Mackerel Advent Calendar to look forward to, but our calendar as well!

qiita.com (Japanese only)

Now on to this week’s update information.

The newly renovated Custom Dashboards

Custom Dashboards is a dashboard feature that lets you freely arrange graphs that you want to see based on usage scenes etc.

With this update, Custom Dashboards has been newly reborn. Now it’s even easier to create and edit dashboards with more flexibility, add graphs by dragging and dropping, change display sizes and positioning, and more.

3 types of widgets

The types of addable information (widgets) are increasing.

Graph widgets

Various graphs can be displayed. Expression graphs can also be created and added from here.

Value widgets

Display the latest values of various metrics in numbers. You can also select expressions for metric type.

Markdown widgets

As with previous Custom Dashboards feature, you can freely write content in Markdown format.

Use it in various scenes

Dashboards can be created easily and put to good use in various scenes of server monitoring and operation!

  • Daily service status checks
  • Reference at weekly/monthly team meetings
  • Use in capacity planning
  • Look back on the system’s status and effects when the campaign was implemented

Automatic registration using the API

Even with the new Custom Dashboards, it’s possible to import and export graphs using the API. This lets you automate the creating / editing of custom dashboards.

For more details, check out the following document.

mackerel.io

Please note that operations using the mkr command for the new Custom Dashboards are not yet supported.

The old Custom Dashboards

With this change, the old Custom Dashboards has been renamed "Legacy Custom Dashboards". You can also browse previously created dashboards here.

We recommend that you use the new Custom Dashboards feature when creating a new dashboard.

check-redis subcommand replication added

replication has been added to the subcommands of check-redis. This makes it possible to check whether Redis replication is working properly. slave is a similar subcommand, but understand that it will become obsolete in the future.

Organization names can now be obtained with mkr org command

The organization name can now be obtained by running mkr org. The following execution results are obtained.

{
  "name": <name>
}

A practical DevOps Hands-on Workshop 〜 Building a safe/secure DevOps environment with AWS and Mackerel!

This is an event announcement.

This event is a hands-on workshop for beginners to Mackerel and the AWS Code series. If you’d like some practical experience with DevOps environments that combine monitoring and CI/CD pipelines, please come and join us!

Mackerel team CRE Inoue (id:a-know)will be presenting at the event!

dev.classmethod.jp (Japanese only)

Event details

  • Date and time:December 10, 2018 (Monday) 2:00 - 4:30 p.m. (Reception start: 1:30 p.m.)
  • Location:Shibuya Hikarie 11th Floor Sky Lobby Conference Room D [MAP]
  • Capacity:24 people
  • Cost:Free
  • Co-sponsors:Classmethod Inc., Hatena Co., Ltd

API Gateway now supported with AWS Integration and more

Hello! Mackerel team CRE Miura (id:missasan) here.

November has finally come and winter is steadily approaching. And with this time of the year, comes the advent calendar season. I’m sure that everyone is watching and waiting to decide which events to participate in this year. Even here at Mackerel, our annual advent calendar is in the works.

qiita.com

Here are a few of Mackerel’s advent calendars from past years.

The advent calendar was originally a calendar used to count down the number of days until Christmas, looking forward to and enjoying each day by opening up little windows to find small candies or chocolates. I’m excited about the fact that is year’s Mackerel advent calendar will be one that gives nice little gifts to all our users.

If you’ve tried out Mackerel for a year, we'd love to hear what made you happy or some things you struggled with. If you’ve never made an advent calendar before, why not make your debut with Mackerel? By all means!

Now on to this week’s update information.

API Gateway now supported with AWS Integration

Following CloudFront, is support for API Gateway. Refer to the following link for details regarding obtainable metrics.

mackerel.io

AWS Integration features are being released one after another. If you haven’t changed your integration settings in a while, be sure to take a look at this in review.

ALB now supported with mackerel-plugin-aws-waf

Up until now, only AWS WAF metrics deployed to CloudFront were targets with mackerel-plugin-aws-waf, but with this update it’s now possible to obtain ALB targeted metrics as well.

github.com

This is a feature made possible by user contributions. Thank you very much!

A posting limit has been set for API service metric posting

Mackerel’s time series data has a granularity of 1 minute, and any metrics posted at a higher frequency than that are overwritten. With this update, we’ve set a limit on the number of service metric postings by API.

If the limited number of posts for each endpoint is exceeded, status 429 is returned. Be careful not to set a posting frequency below once per minute.

In addition, it’s also possible for the service metric posting API to send multiple metrics at once. By sending multiple metrics together instead of one by one, you can avoid posting restrictions on the API. For more details, refer to the help page.

Monitoring solution seminar featuring: Cloud Portal x SIOS Coati x and Mackerel!

Sony Network Communications (Cloud Portal), SIOS Technology (SIOS Coati), and your very own Hatena (Mackerel) are holding a seminar together.

Event outline

  • Date and time: Wednesday, December 12th, 15:00-17:30
  • Location: Akihabara UDX 4F Next-2 (2 min walk from Akihabara Station)
  • Capacity: 80 people

Sign up here

www.bit-drive.ne.jp (Japanese only)

A new feature has been added to display maintenance and incident information from the management screen etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

We recently released a new case study featuring SEGA Games and their use of Mackerel in social network gaming environments. This article goes over the introduction of Mackerel motivated by transitioning to the cloud and how it’s used in daily operation. Be sure to check it out!

Now on to this week’s update information.

A new feature has been added to display maintenance and incident information from the management screen

When Mackerel is under maintenance or if an incident has occurred, a message will now be displayed at the top of the screen as shown in the below images.

During maintenance

During an incident

Now it’s possible to see information from Mackerel’s status page in a more convenient way. Use this to eliminate false alarms. For more detailed information, continue to check the status page and follow our official Twitter timeline.

A plugin name can now be set to User-Agent for plugins that send http requests

For plugins that send http requests, a plugin name such as mackerel-plugin-plack can now be set to the User-Agent at the time of request. Up until now, the standard User-Agent of Go was used.

Now, we can easily distinguish where the request was issued with User-Agent when viewing the access log. Please be cautious when using User-Agent for access restrictions etc.

SEGA Games case study!

Check out our latest case study through the link below!

mackerel.io (Japanese only)

A detailed report on the incident that occurred on 9/26 (Wed) and our subsequent response

This is a detailed report regarding the incident that occurred on Wednesday (9/26) and our response following the event.

Time of the incident

  • Time of the incident: 2018/09/26 10:51-15:20 (JST)
  • Events that occurred: Malfunction of the Mackerel system and suspension of connectivity monitoring

Event timeline (JST)

10:51 Redis failover and malfunction

As we’ve seen a trend of increased memory use in Redis, which is used for monitoring data storage, we implemented an operation to build a replication in order to scale up Redis. However, building the replication took time, and because Redis was unresponsive, it was detected as a node failure by clustering software (keepalived), and this caused an unintended failover.

As a result, connecting to Redis from the application server was unavailable and we weren’t able to respond properly.

10:55 Recovery and continued failure

We switched over to a more appropriate Redis and this temporarily restored the application.

However, the application was unstable, and along with complex factors such as network errors, Redis memory increasing, etc. that occurred before the incident happened, the application server latency deteriorated and a timeout occurred. Afterward, we restarted the application server, but the symptoms did not improve.

We’ve determined this to be the cause of the prolonged failure. The details surrounding this cause are unconfirmed and further investigation into the matter has been canceled.

11:00-14:50 Incident response

The following specific actions were made in response.

  • Detected organizations posting inappropriate metrics and cut off these requests
  • Adjusted the application server timeout interval
  • Temporarily reinforced Diamond(TSDB)
    • Increased the Kinesis shard count / Redis Cluster shard count
  • Scale out the API use application server

Following these actions, we temporarily switched to maintenance mode, and after confirmation was made within the company, we gradually returned requests starting from the outside.

15:20 Recovery confirmation

After confirmation was made that the application server response was stable, metric retransmission from mackerel-agent had completed, and the delay in reflecting metrics data to the TSDB was resolved, then we declared the restoration.

Cause of the incident

This incident was caused by an unexpected failover accompanying the implementation of an operation in Redis. However, we have not been able to accurately identify the reasons behind the subsequent prolonged restoration. The following is a theory that was raised when looking back on the incident.

  • When a specific access pattern continues, a thread pool or connection pool with the same wait period or lock contention may cause latency to deteriorate in a Scala application.

Verifying the theory

In order to confirm the above theory, we attempted to reproduce the situation at the time of the incident in an isolated environment. Unfortunately, we were unsuccessful in reproducing an identical situation.

Future support

As previously mentioned, we are unable to specify the details surrounding the cause of this incident, but we do believe that we can prevent similar long-term failures in the future by implementing the following countermeasures.

Reviewing Redis failover behavior (implemented)

To cope with the failover malfunction which was the cause of all this trouble in the first place, we’ve responded by increasing the number of Redis keepalived health checks so as not to cause a failover with such a short period load increase.

Reinforcing the application (implemented)

We created more of a buffer for application performance. Specifically, the following was done.

  • Scale up work for Redis
  • Increased the number of application servers

Improving the efficiency of monitoring data stored in Redis (implemented)

As a basic response to the amount of memory used in Redis, the application was refurbished, efficiently saving only the necessary monitoring data in Redis, and the Redis memory usage was reduced.

Counteracting improper requests (implemented)

  • Building a mechanism to quickly detect improper API requests
  • Creating a feature to quickly block organizations making improper requests

In the future, we will continue to try and improve the accuracy of detecting improper requests and we are also considering building up other features such as the API Limit etc.

Reinforcing application monitoring

We plan to strengthen the application server’s internal process monitoring and prepare a system that can respond before another similar incident occurs.

Summary

We understand that this incident and its extended was an inconvenience and we sincerely apologize. As we work hard to prevent recurrence, please trust that we will do our best to localize the problem even if a similar incident occurs.

Thank you for choosing Mackerel.