Mackerel blog #mackerelio

The Official Blog of Mackerel

(Appended)【Urgent Maintenance】The system will temporarily shut down for database maintenance on Feb. 2 (Thur) at 2:30 pm

Thank you for choosing Mackerel.

【Added content】

Regarding the scheduled maintenance, the problem is being resolved and the maintenance has been cancelled. For more details, refer to the following entry.

mackerel.io

【End of added content】

The system will be temporarily suspended for maintenance according to the schedule below. We apologize for any inconvenience that this may cause and appreciate your understanding.

Moreover, we apologize for such a short notice due to the urgency of this maintenance.

Scheduled date and time

  • February 7th (Thur) 2:30 pm - 4:30 pm (JST)
    • This time window is the longest case scenario. A breakdown of the procedure of operations will follow below.

Content

  • Database maintenance

Regarding impact on the day of

  • The above time window is an estimate of the longest case scenario. The actual maintenance period will end once the work has been completed.
    • Once maintenance has begun, the entire Mackerel system will shutdown for a short period of time.
  • Web access to Mackerel, data posting by the agent, API access (including the CLI tool), alert notifications, etc. will be unavailable
  • As soon as the maintenance work is completed and operation is confirmed, the system shutdown will end and an announcement will be made
  • As for mackerel-agent metric posting, data will be buffered from mackerel-agent during the maintenance period and resent after maintenance has been completed
    • If resent properly, graphs during the maintenance period will also be displayed.

Reason for maintenance

In PostgreSQL (RDS), a data store used by Mackerel, autovacuum isn’t running properly. In the current state, service operations have not been affected, but if this state continues, the transaction ID will be exhausted and the update function for the database will be forcibly stopped. For this reason, we’ve decided that urgent maintenance is necessary.

Related to this, a similar Bug Report (https://www.postgresql-archive.org/found-xmin-from-before-relfrozenxid-on-pg-catalog-pg-authid-td6011810.html ) was submitted and found that the problem can be restored by rebooting.

We will try this during the maintenance period. In addition, as a result of preliminary inspection, we’ve confirmed that autovacuum can be run on DB restored from snapshot. And if the problem can’t be resolved by rebooting, that will be the next issue addressed.

Procedure of operations

  • Temporary system shutdown
  • RDS reboot
    • If recovery is confirmed here, maintenance will end
  • Replace with DB restored from snapshot
  • After recovery is confirmed, maintenance will end

The 2 hour maintenance time window is an estimation of recovery time based on the snapshot. If the issue is resolved by rebooting, maintenance will end at that point.

Regarding Announcements

Situation reports will be made from the Mackerel status page (http://status.mackerel.io) as well as from this blog (https://mackerel.io/blog/).

Additionally, we’ll also be using our official Twitter account (https://twitter.com/mackerelio_jp).

For inquiries related to this matter

Please send all inquiries regarding this matter to support@mackerel.io.

Thank you again for your understanding and cooperation. And thank you for choosing Mackerel.