
Thank you for your continued use of Mackerel.
We sincerely apologise for the prolonged inconvenience and concern caused by the historical metric loss incident that occurred on Mackerel on 6 November 2025.
We hereby report the overview, cause, and future recurrence prevention measures for this incident.
Incident Summary
- Period: 6 November 2025, 10:52 to 15 November 2025, 10:50 (JST; same applies below)
Incident: Metric display outage for specific periods within certain organizations
Additionally, during recovery operations, the following secondary service impacts occurred temporarily.
Period: 7 November 2025, 06:09 to 07:54
Incident: Partial failures in posting tracesPeriod: 7 November 2025, 06:13 to 08:17
Incident: Metric display and associated monitoring was unstablePeriod: 7 November 2025, 16:24–17:17 Incident: Monitoring via queries on labeled metrics was unstable
Period: 7 November 2025, 16:45–17:30
Incident: Partial failures occurred in posting labeled metrics
All metrics previously missing from display have now been restored. Please note that no customer data was compromised as a result of this incident.
Timeline
- 6 November 2025
- 10:52 The migration from Redis cluster to Valkey cluster is started.
- 11:46 Upon investigating the suspicious migration status, it was found that the some metrics were not being displayed.
- 11:50 Initiated incident response formation.
- 12:15 Announced the metric viewing outage.
- 12:46 Initiate recovery of metric data from Amazon Kinesis Data Streams for the past 24 hours.
- 7 November 2025
- 07:30 Completed recovery of the past 24 hours' worth of one-minute granularity metrics.
- 17:26 Identified missing period as 30 October 2025 to 5 November 2025 13:00. (referred to as "the affected period")
- 18:00 Decided to recalculate metrics with 5-minute granularity or greater to restore missing data for the affected period at granularity of 1 hour or more.
- 9 November 2025
- 12:43 Completed restoration of service metrics with 5-minute granularity or greater for the relevant period.
- 14:27 Completed restoration of labeled metrics with 5-minute granularity or greater for the relevant period.
- 10 November 2025
- 14:57 Completed restoration of role metrics with 5-minute granularity or greater for the relevant period.
- 15:24 Announcement of service stabilization.
- 15 November 2025
- 10:50 Host metrics with 5-minute granularity or greater for the relevant period restored.
- 17 November 2025
- 17:11 Restoration of all metrics announced.
Cause of the Outage
The outage was caused by an unintended failure during the migration of metrics from the Redis cluster to the Valkey cluster, resulting in the loss of some data.
The time-series database storing Mackerel metrics is composed of a combination of multiple services on a public cloud. Among these, Amazon ElastiCache was adopted to store posted data points and write them in batches. ElastiCache had previously utilised Redis for its implementation. Considering the end-of-life (EoL) period for the version in use, the migration to Valkey, an equivalent implementation, was carried out this time. Regarding the migration from Redis to Valkey, a live migration method without downtime exists, and verification in the staging environment had confirmed a successful migration.
However, when executed in the production environment, an unintended failure occurred mid-migration, resulting in the loss of data points stored in the shards of some nodes.
Future Actions
The migration from Redis to Valkey is now complete, so we do not expect this specific issue to recur. Furthermore, although it took time, we have successfully recovered the missing metrics using backup data and other resources.
Moving forward, we are progressing with measures to prevent data loss failures caused by unintended migration failures, to reduce recovery times, and to address secondary issues that arose.
- Re-evaluation of the time-series database update process. Particular attention will be paid to preventing data loss, including consideration of non-live migration procedures (e.g., duplicate writes).
- Investigation into migrating metrics stored in Valkey to more resilient storage, such as Amazon DynamoDB, at an earlier stage. This aims to enable faster recovery should data on Valkey become lost.
- Formulation of recovery procedures. This will facilitate quicker recovery.
- Monitor AWS Lambda concurrent execution quota. During recovery, exceeding AWS Lambda concurrent execution limits caused secondary failures.
- Monitor Amazon S3 error counts and reduce the frequency of GetObject calls through object aggregation. During recovery, exceeding Amazon S3 GetObject rate limits caused secondary failures.
Summary
We would like to reiterate our sincere apologies for the inconvenience caused to our customers due to this prolonged outage. We are committed to preventing recurrence and enhancing system stability, whilst continuing our efforts to minimise the impact of future incidents.