Mackerel blog #mackerelio

The Official Blog of Mackerel

Regarding false connectivity monitoring alerts from Mackerel related to .io domain malfunctions and our response

Mackerel sub-producer id:Songmu here. I’d like to sincerely apologize to all of our users for the frequent inconveniences related to the subject matter.

In this blog, I’ll go over the details of this matter and our response hereafter.

About false connectivity monitoring alerts

In Mackerel, when metric posting from mackerel-agent gets interrupted for a certain period of time, the server goes down and connectivity monitoring alerts are sent out.

Currently, Domain Name Resolution for mackerel.io is unstable. As a result, name resolution is failing and the mackerel-agent is unable to post metrics due to temporary inaccess of Mackerel, thus causing the Mackerel system to recognize the server as having gone down and sending out false connectivity monitoring alerts.

Regarding name resolution instability for mackerel.io

Currently, one section of the DNS server, the io domain authority, is temporarily returning domains which should exist as NXDOMAIN (domain does not exist) by mistake. Name resolution is failing for this reason and we are reviewing the matter.

Additionally, because the negative cache TTL is set to 900 seconds, name resolution fails for up to 15 minutes.

Although this matter is only temporary, we have confirmed that it has occurred more than once and may occur again in the future. Therefore, we are implementing the following tentative solutions.

Temporary solutions

In order to keep false alerts to a minimum, the following solutions are currently being implemented.

  • extending the mackerel.io domain TTL
  • temporarily suspending connectivity monitoring component when mackerel.io name resolution is unstable
  • configuring the connectivity monitoring judgment interval for longer than the negative cache TTL

We have managed to reduce the amount of false alerts through these measures, but the reliability of connectivity monitoring is still low. Especially when the server actually goes down, reporting takes 15 minutes or more. So, we recognize that separate fundamental measures are also required.

Fundamental measures

Over the next month or so, we will acquire a new TLD domain that is more reliable and switch mackerel-agent to use that domain. For backward compatibility, the current mackerel.io will continue running concurrently. In order to use the new domain, you’ll need to upgrade mackerel-agent.

Additionally, although still under review, we are considering setting up a mechanism to use an alternate route (another domain etc.) if mackerel-agent fails to post in Mackerel.

Again, I sincerely apologize for any inconvenience this matter may have caused. In the future, we will do our best maintain a stable operation. Thank you for choosing Mackerel.