Apology and Detailed Report on the Overcounting of Cloud Integration Metrics

This is Mackerel’s Assistant Producer, id:wtatsuru.

Between February 20, 2019, to April 11, 2022, the number of metrics used to account for the number of billable cloud hosts registered through Mackerel’s cloud Integration function was being overcounted, causing extra charges for customers using the Cloud Integration function. After discovering this issue on March 30, 2022, we have released two reports detailing the issue and our future action plan; the first on April 4 (link to report), and the second on April 28 (link to report). As we have completed a thorough investigation of the issue and are ready to refund customers for the extra charges incurred, we are providing a full report of the issue together with our apologies as well as refund details, as follows. We would like to express our sincere apology for any inconveniences this issue has caused our customers.

Affected Services and Period Affected

AWS Integration - WAF
- From March 23, 2020, to March 31, 2022
AWS Integration - Redshift
- From February 20, 2019, to April 8, 2022
AWS Integration - Step Functions
- From May 9, 2019, to April 8, 2022
Google Cloud Integration - App Engine
- From September 24, 2020, to April 11, 2022
Google Cloud Integration - Compute Engine
- From September 24, 2020, to April 11, 2022

Issue

Due to defects in Mackerel’s system, the number of metrics acquired through the Cloud Integration was overcounted for some customers using specific configurations. When this resulted in the upper limit of metrics per host being exceeded, billings for extra hosts exceeding actual use occurred.

On Mackerel’s system, standard hosts have a limit of 200 metrics while micro hosts have a limit of 30 metrics. When the number of metrics posted exceeds the plan limit for the particular type of host, it is counted as host overages, which incur additional fees. The number of metrics posted for a host are counted at a fixed interval of within one hour. Please refer to the FAQ below for details on this calculation.

Handling of host conversion when plan limits are exceeded – Mackerel Support

Sequence of events

March 30, 2022 - Issue discovered following an inquiry from a customer; investigation started
April 4, 2022 - First Report: Report on Overcounting of Metrics for Linked AWS WAF Integration - Announcements
April 11, 2022 - Fix applied to prevent overcounting of metrics
April 28, 2022 - Announcement about the completion of investigations on the scope of impact: Follow-up report: Report on the Overcounting of Cloud Integration Metrics - Announcements
June 15, 2022 - Final report (This announcement)

Root cause

When using Mackerel’s Cloud Integration, the customer’s cloud services are managed as a Mackerel host and customers can acquire and monitor metrics. Using this function, customers may register their cloud access keys on Mackerel, allowing Mackerel’s system to use the access information and send requests to the cloud platform. Then, the data retrieved from the cloud service are converted to Mackerel metrics that are posted to the customer's organization.

The name of metrics—acquired from Cloud Integration—on the graph posted on Mackerel is automatically generated by Mackerel. The following issues occurred on Mackerel's system when the metric names are generated and posted, causing the number of metrics—used to account for the number of billable hosts—to be overcounted compared to the actual usage.

Cases where different metrics were posted under the same metric name

In this case, only one of the metrics was displayed. However, as the metrics with the same name were posted multiple times in the system background, the number of metrics—used to account for the number of billable hosts—was counted multiple times. As the possibility of specific cloud service and customer configurations causing different metrics to be posted under the same metric name was not anticipated during the development stage, Mackerel’s system was not equipped to deal with such issues.

Cases where the same metric was posted multiple times

Due to defects in the process of calculating the number of metrics retrieved from cloud services on Mackerel’s system, there were cases in which the same metric was retrieved multiple times. In such cases, the duplicates were also omitted in the display but included in the number of metrics used to account for the number of billable hosts.

Affected configurations for each cloud service

AWS Integration - WAF
- Under any of the following circumstances, the 5 metrics named waf.web_acl_requests.#.* were acquired as many times as they are included in the rules.
  - On WAF v2, non-managed rule group is used and metrics are published to CloudWatch
  - On WAF Classic, rule and rule group added to WebACL under the same name
AWS Integration - Redshift
- For the following metrics, when multiple Amazon Redshift clusters exist on an AWS account, the metrics within the same cluster could be retrieved multiple times.
  - custom.redshift.query_runtime_breakdown.*
  - custom.redshift.wlm_query_throughput.*
  - custom.redshift.wlm_query_duration.*
  - custom.redshift.wlm_queue_length.*
AWS Integration - Step Functions
- When Lambda is referenced using a qualified ARN, the names of the following 5 metrics overlapped and were retrieved multiple times.
  - custom.states.lambda_functions.#.scheduled
  - custom.states.lambda_functions.#.started
  - custom.states.lambda_functions.#.timed_out
  - custom.states.lambda_functions.#.failed
  - custom.states.lambda_functions.#.succeeded
Google Cloud Integration - App Engine
- When the App Engine app’s traffic is spread across multiple zones or locations, the metrics were retrieved in duplicates under the same name for each zone or location.
  - GAE Application
    - Same for both Flexible Environment and Standard Environment
      - appengine.http.server.dos_intercept.count
      - appengine.http.server.quota_denial.count
      - appengine.http.server.response_count.#.count
      - appengine.http.server.response_count.#.loading_count
      - appengine.http.server.response_latencies.#.mean
      - appengine.http.server.response_latencies.#.loading_mean
      - appengine.http.server.response_style.#.count
      - appengine.http.server.response_style.#.count_cached
      - appengine.memcache.#.centi_mcu_count.count
      - appengine.memcache.#.operation.count
      - appengine.memcache.#.bytes.received.*
      - appengine.memcache.#.bytes.sent.*
    - Flexible Environment
      - appengine.flex.connections.current.count
      - appengine.flex.cpu.reserved_cores.count
      - appengine.flex.cpu.utilization.utilization
      - appengine.flex.disk.bytes.read
      - appengine.flex.disk.bytes.write
      - appengine.flex.network.bytes.received
      - appengine.flex.network.bytes.sent
    - Standard Environment
      - appengine.system.cpu.usage.*
      - appengine.system.memory.usage
      - appengine.system.network.bytes.received
      - appengine.system.network.bytes.received_cached
      - appengine.system.network.bytes.sent
      - appengine.system.network.bytes.sent_cached
  - GAE Instance
    - Flexible Environment
      - appengine.flex.instance.#.connections.current
      - appengine.flex.instance.#.cpu.utilization.utilization
      - appengine.flex.instance.#.network.bytes.received
      - appengine.flex.instance.#.network.bytes.sent
Google Cloud Integration - Compute Engine
- When an instance is under the influence of a Layer 3 load balancer and there is some traffic that goes through the load balancer and others that don’t, the following metrics were retrieved multiple times.
  - gce.instance.network.received
  - gce.instance.network.sent
  - gce.instance.network_packets.received
  - gce.instance.network_packets.sent

Fixes and future prevention

By April 11, we applied patches so that metrics with the same name are posted only once on Mackerel even when they are retrieved multiple times. As a result, the number of metrics used to determine the number of billable hosts will not exceed the number of metrics displayed on the web console. To prevent future occurrences, we will implement two things; first, we will add a review process that prevents the same issue from being introduced when adding new metrics and second, we will monitor the system to discover issues when metrics have overlapping names.

The issue whereby different metrics are assigned the same name and only one of them is displayed has yet to be fixed. We will continue to explore fixes for each cloud service provider. We apologize for the inconvenience caused.

Refund of usage fee payments

We will refund payments for excessive billings that may have been caused by this defect. We will contact the owners of the affected organizations by email, so please kindly check. Customers who are not contacted individually are not affected.

The number of hosts applicable for the refund shall be calculated as follows.

For hosts of the aforementioned cloud services that are using the Cloud Integration and were subject to additional host fees for exceeding metrics limits within the given plan
All additional fees charged for exceeding metrics limits within the given plan will be included in the calculation

Depending on the customer’s configurations, some metric names and the number of metrics can change, making it difficult to accurately track the full impact of this defect. Thus, regardless of whether it is due to this defect or not, all hosts that may have the number of metrics overcounted compared to actual usage will be refunded for the number of additional hosts that they were charged for during the affected period.

Again, we apologize for all inconveniences caused by this matter. We will continue to make every effort to improve the quality of our service, and we ask for your continued support of Mackerel.