Mackerel blog #mackerelio

The Official Blog of Mackerel

mkr v0.31.1 Docker Image release etc.

Hello! Mackerel team CRE Miura (id:missasan) here.

Since the end of the rainy season, the heat has been relentless. Everyone please be sure to take good care and avoid heat stroke.

Now on to the update information.

mkr v0.31.1 Docker Image release

Access the Docker Image here. In the future, we plan to provide a new Docker image for each mkr command update. Use it to build development environments and to simplify and optimize version control.

Mackerel Meetup #12 on August 2nd (Thurs.)!

Mackerel Meetup # 12 will be held on Thursday, August 2nd! This time, Drecom was nice enough to let us use their seminar room for the venue.

At the event, user sessions will be held by two guests and the Mackerel team will present on the latest developments and the soon to be released anomaly detection feature.

Come and help us celebrate our achievement of 200 consecutive weeks of releases at this event!

mackerelio.connpass.com (Japanese only)

Retry option added to the mkr throw command

Hello! Mackerel team CRE Miura (id:missasan) here.

We just announced last week that we will no longer be sticking to our previous schedule of consecutive weekly releases, but we already have a new update to deliver. In the future, we would like to actively introduce new update information in this way, by making announcements via this blog and the newsletter when releases are made.

Now on the update information.

Retry option added to the mkr throw command

In mkr v0.31.1, when posting metrics using the mkr throw command, you can now specify to retry failed posts and the number of attempts by attaching the --retry N option. Using this option, you can lower the possibility of data loss when metrics can’t be posted, such as when a request accidentally fails or when Mackerel is under maintenance. Please note that retries will not be carried out if an HTTP error such as 403/404 occurs.

Update the mkr command and give it a try!

Reaching 200 weeks of consecutive feature releases and about Mackerel’s plans for the future

Hello! Mackerel Team Director id:daiksy here.

Last week, Mackerel finally reached its 200th week of consecutive releases! ! !

The road to 200 weeks

The Mackerel service officially launched on September 17th of 2014 and has continued to release new features every week since.

For the Mackerel team, every Tuesday and Thursday are regular release days dedicated to working in some way or another in preparation for that week’s main release. And on every Monday (previously Friday), we publish the content of that week. We prefer to limit the content that is publicized to new features that provide new value to users, and don’t usually include bug fixes and minor modifications. In other words, a new feature of some kind is released every week without fail.

Strictly speaking, the term "weekly" excludes long-term periods of national holiday vacation, such as Golden Week and the New Year's holiday, so "every business week" may be more accurate.

At the time of Mackerel’s launch in September of 2014, the service only had minimal features. For us, starting out as a small product, we believed that it was our mission to deliver new features to users as quickly as possible. And as a result of continuing to do so, we reached 100 consecutive weekly releases in June of 2016.

mackerel.io

At that time, the team celebrated and ate Mackerel themed cake!

And now two years later, Mackerel's "weekly feature release" has reached 200 consecutive weeks.

Comments from people involved

Here are a few comments about reaching 200 consecutive weekly releases that we received from some familiar faces.

Current Hatena CTO and founding Mackerel Director, id:motemen

Originally, the idea behind weekly releases started as a lighthearted way to gain momentum during the startup period, but I’m surprised to see that it’s continued for four years now. If you think about it, Mackerel has come a long way!

Previous Hatena CTO and founding Mackerel Product Owner, id:stanaka

Even before Mackerel, one of Hatena's strong points has always been its speed of service development. Likewise, I think that Mackerel has met those expectations with its weekly feature release. I hope that Mackerel continues to meet the expectations of users and that its service can grow by not only focusing on speed, but also by taking on more challenging developments in the future.

Current Product Owner, id:Songmu

I’ve been involved with Mackerel since the beginning of its official release and have been engaged in development as a product owner / manager for a long time, and I am deeply impressed by our achievement of reaching 200 consecutive weekly releases. As a team, we are determined to continue to keep up with the evolution of infrastructure technology around the world, to contribute to service development and operation processes, and provide services that will aid in business growth by continuing to develop new features that can “provide new experiences" as stated in Hatena’s mission statement.

Thank you so much for the comments everyone!

Also, here are a few pictures of the commemorative cakes that were made to celebrate at both the Kyoto and Tokyo office parties last week.

The future of Mackerel

To celebrate last week’s achievement of reaching 200 consecutive releases, we redesigned the Mackerel top page (https://mackerel.io).

This is also to show the development team’s determination to deliver "A new Mackerel for tomorrow".

Mackerel’s development roadmap, which we previously introduced at Meet-up, marked some big updates for the future such as "anomaly detection" and "container support". Now, development for these new features has taken on a new significance for the team.

Throughout the last 200 weeks, the development team has steadily released features that the original product owner had pictured for the service at its launch in September 2014. Now we’re working on the development of new features. These features were not a part of the initial plan, but will make up the feature group of the next generation. Now that we’ve finally finished developing the feature group that the founding product owner had imagined, we’ll be working on a new development roadmap that will lead the next generation of Mackerel.

From here on, the development team will focus on developing these next generation features to further push Mackerel forward.

Reaching an achievement of 200 consecutive weekly releases, we believe that Mackerel has finally finished its first stage of development, which, at the time of it’s launch in September 2014, our goal was to grow the small service as quickly as possible. Now, In order to deliver a new Mackerel to users, we believe that the next step is to take a break from our "Weekly Feature Release", and to focus all of our team’s effort on developing the next big feature.

Our regular release days will continue every Tuesday and Thursday and we will continue to make announcements when new features are released, however the pace at which releases are made may slow down a bit from now on.

We welcome your requests, appreciate your support, and promise to continue putting our full efforts into building the new Mackerel.

Thank you for choosing Mackerel.

200 weeks of consecutive releases! Alert groups feature released!

Hello! Mackerel team CRE Miura (id:missasan) here.

Mackerel's record for consecutive releases has finally reached its 200th week. And we owe it all to you guys. As always, thank you so much for all the pull requests and feedback!

To celebrate this release, we’ve redesigned the Mackerel top page.

Now on to this week’s update information.

Alert groups feature has been released

Up until now, there have probably been quite a few people who have experienced trouble with large quantities of alerts, for example, when a failure occurs in a place that has a wide range of impact such as the network or storage. It can be easy to overlook important alerts that get buried in such large quantities and difficult to grasp their chronological order.

This week we’ve released the “Alert groups” feature that consolidates multiple alerts that occur at the same time into groups, reduces the amount of notifications, and makes it easier to understand the chronological order of alerts.

By configuring an alert group, you can consolidate related alerts into one group by specifying the service, role, or monitor settings.

For more of the specifics, take a look at the following screen in the help page.

f:id:mackerelio:20180627155033p:plain

In this case, alerts for multiple servers (app1.sample.com, app3.sample.com) and multiple monitoring items (connectivity, URL outline monitoring, custom.sample.foobar) are consolidated into a single group called Example-Service. Creating an alert group makes it easier to see the sequence of troubles that occurred in chronological order and whether or not they’ve been closed. On top of that, since notifications are sent based on the alert group, the problem with inboxes and chat tools being overrun by notifications is solved. By receiving one alert in the alert group, you can simply watch the status of that alert group, making operations more simple.

If you’re unsure about what kinds of things to group, first try creating a group for a role that has multiple hosts. If you don’t have such a role, we suggest trying to setup an alert group for one service.

Create an alert group from the "Create new alert group setting" screen and configure notifications for alert groups in Notification channels.

Be sure to check out the the help page below and give this new feature a try!

mackerel.io

On June 28th (Thurs.), browsing for certain documents will be unavailable for a short period of time between 3:00-3:30 p.m.

Thank you for choosing Mackerel.

Mackerel Team Director id:daiksy here.

Due to scheduled maintenance, browsing for several Mackerel documents will be temporarily unavailable during the following time period.

Please note that this maintenance will not affect Mackerel's main service.

We apologize for the inconvenience and thank you for your understanding in our efforts to provide a more stable service.

User authorityType can now be obtained with the API / maintenance notice for Wednesday, July 18th etc.

This is our 199th week of consecutive releases! Only 1 more week until the big 200th!

Hello! Mackerel team CRE Miura (id:missasan) here.

As was announced from this blog the other day, database maintenance will take place on Wednesday, July 18th. Please review for further details.

Now on to this week’s update information.

User authorityType can now be obtained with the API

It is now possible to get user authority information with the API that obtains the user list. Check out the API document below for more details.

mackerel.io

Improvements made to mackerel-plugin-aws-kinesis-streams metrics

Problems related to obtaining the following metrics have been fixed.

  • ReadProvisionedThroughputExceeded
  • WriteProvisionedThroughputExceeded

Temporary system shutdown for database maintenance on July 18th (Wed.)

As previously announced, database maintenance including a temporary shutdown of the system will be carried out on Wednesday, July 18th. For more details regarding subject matter and the extent of impact on the day of, please see the blog below.

mackerel.io

Understanding why the Linux loadavg rises every 7 hours

Mackerel team engineer id:itchyny here (Mackerel is a server monitoring service and mackerel-agent is a daemon program to collect and post server's metrics).

”When mackerel-agent is installed, the loadavg rises every 7 hours”

Recently, we’ve received several inquiries like the one above from multiple customers. So I tried it out for myself, and sure enough, this issue was reproduced. I installed mackerel-agent on EC2t2.micro, configured basic log and process monitoring, and left it for a few days.

f:id:mackerelio:20180611212106p:plain

Indeed, the loadavg rose approximately every 7 hours. I did not configure cron for this cycle and no processing was done even within mackerel-agent. However, the loadavg's peak value increased the more plugins were added.

This entry explains the cause of this phenomenon.

To understand the reasons why the loadavg rises, we first need to understand how the loadavg itself is calculated. So let’s first go over how Linux calculates loadavg.

The Linux loadavg is an exponential moving average value of the total number of “runnable” processes in the run queue and disk I/O waited (uninterruptible) processes. The following are statuses of the Linux process.

 % man ps | grep -A 10 "^PROCESS STATE"
PROCESS STATE CODES
       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display to describe the state of a process:

               D    uninterruptible sleep (usually IO)
               R    running or runnable (on run queue)
               S    interruptible sleep (waiting for an event to complete)
               T    stopped, either by a job control signal or because it is being traced
               W    paging (not valid since the 2.6.xx kernel)
               X    dead (should never be seen)
               Z    defunct ("zombie") process, terminated but not reaped by its parent

The loadavg is the value that represents the number of processes with the statuses R and D smoothed by the exponential weighted average.

Let's express the total number of running and uninterruptible processes subject to the loadavg as p(t) with t as a function for time. If the sampling interval is  \Delta t and the time constant is  T, the exponential moving average of the number of processes, that is, the exponentially decaying weighted average can be written as follows 1. $$ \begin{align} L(t) &= \big(1 - e^{-\Delta t / T}\big) \big( p(t) + e^{-\Delta t / T} p(t - \Delta t) + e^{-2\Delta t / T} p(t - 2 \Delta t) + \cdots \big) \\ &= \big(1 - e^{-\Delta t / T}\big) p(t) + e^{-\Delta t / T} L(t - \Delta t) \end{align} $$ The exponential moving average value of p(t) accumulated over time can be expressed with the weighted average of the previously calculated value L(t-\Delta t) and the current value p(t). Even if you haven’t kept a history of the number of processes up until this point, we can calculate a new loadavg from the previous loadavg and the current number of processes.

Based on this recurrence formula, by selecting an appropriate constant and comparing the simulated value (\Delta t = 5, T=60, 300, 900) and the actual value from Linux (Run yes > /dev/null& on EC2 t2.micro and drop that process after 10 minutes), the following results were obtained. f:id:mackerelio:20180608155746p:plain

Let’s confirm that the loadavg is actually being calculated based on the above recurrence formula while referring to the Linux source code. First, let’s look into kernel/sched/loadavg.c.

/* Variables and functions for calc_load */
atomic_long_t calc_load_tasks;
unsigned long calc_load_update;
unsigned long avenrun[3];

void calc_global_load(unsigned long ticks)
{
    unsigned long sample_window;
    long active, delta;

    sample_window = READ_ONCE(calc_load_update);
    if (time_before(jiffies, sample_window + 10))
        return;

    // ...

    active = atomic_long_read(&calc_load_tasks);
    active = active > 0 ? active * FIXED_1 : 0;

    avenrun[0] = calc_load(avenrun[0], EXP_1, active);
    avenrun[1] = calc_load(avenrun[1], EXP_5, active);
    avenrun[2] = calc_load(avenrun[2], EXP_15, active);

    WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ);

    // ...
}

The loadavg values are stored in avenrun. We can see that in order for this value to periodically update, the value of calc_load_update is compared against jiffies (a variable representing time, increasing by 1 every tick) while increasing by LOAD_FREQ. In other words, loadavg is updated every LOAD_FREQ. As described in include/linux/sched/loadavg.h, this value is 5*HZ+1. Since HZ is the value of how many ticks increase per second 2, we can see that the loadavg is updated every 5 seconds plus one tick. Now let’s look at calc_load which is the actual value calculation.

/*
 * a1 = a0 * e + a * (1 - e)
 */
static unsigned long
calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
    unsigned long newload;

    newload = load * exp + active * (FIXED_1 - exp);
    if (active >= load)
        newload += FIXED_1-1;

    return newload / FIXED_1;
}

Ignore the branching statement and write it in one expression 3.

 avenrun[0] = ((FIXED_1 - EXP_1) * active + EXP_1 * avenrun[0]) / FIXED_1;

Here, according to include/linux/sched/loadavg.h, EXP_1 = 1884 and FIXED_1 = (1<<11). By substituting \Delta t=5, T = 60 for the recurrence formula obtained above, we get following. $$ \begin{align} L(t) = \big(1 - e^{-1/12}\big) p(t) + e^{-1/12} L(t - \Delta t) \end{align} $$ Taking it into account that Linux uses values shifted by 11 bits (i.e. scaled by FIXED_1) to calculate only by integral operations, and e^{-1/12}2^{11} equals to roughly 1884, we can see that the calculation in Linux indeed matches the recurrence formula. Loadavg 5, 15 can be checked in the same way.

The loadavg is updated approximately every 5 seconds using the number of processes in the run queue at that moment (calc_load_tasks) and the recurrence formula. To be exact, it is calculated every 5*HZ+1 ticks. This +1 is the cause for the loadavg rising roughly every 7 hours.

mackerel-agent opens the plugin processes to collect metrics every minute. Since recalculation of the loadavg takes place approximately every 5 seconds, the timing of when mackerel-agent starts collecting metrics and the timing of the loadavg recalculation will periodically be in sync.

In an environment where HZ is 1000, the loadavg is updated every 5.001 seconds. When multiplied by 5000, the cycle is 25005 seconds (or 6 hours, 56 minutes and 45 seconds) and the +1 accumulates and coincides with 5*HZ. This cycle is not a multiple of 60 seconds which is the metric collection interval of mackerel-agent, however, a phenomenon occurs when the timing of opening processes and the loadavg update roughly overlap (off by 0.003 seconds) every 6 hours and 57 minutes.

In an environment where HZ is 250, repeating 5*HZ+1 1250 times comes out to 6255 seconds (or 1 hour 44 minutes 15 seconds). It’s normal for the loadavg to go up during this cycle, but we can’t ignore the fact that the metric collection timing for mackerel-agent shifts by 0.012 seconds each cycle. The value is multiplied by 4, that is, 6 hours and 57 minutes, which is a multiple of 60 seconds and this overlaps perfectly with mackerel-agent 's metrics collection timing. In my own experience HZ equaled 250. If you look at the graph at the beginning of this entry, you can see that there are peaks of uniform height with a cycle of 6 hours 57 minutes. You can also see that there are smaller peaks 1 hour 44 minutes before and after these peaks. This is a phenomenon derived from HZ being 250.

So why isn’t the loadavg update interval exact 5 seconds? Well, in Linux, various processes are running in their own cycle. And if a process with a 5 second interval exists, there is a possibility that it may sync with the loadavg recalculation timing. In order to prevent this phenomenon of the loadavg unintentionally rising due to this timing coincidence, the cycle is slightly shifted from 5 seconds. For more details on this, take a look at the mailing list or the patch commit message where 1 was added to 5*HZ.

The loadavg update interval in Linux is intentionally slightly deviated from 5 seconds. This slight deviation accumulates and is a multiple of 60 seconds in a cycle of about 6 hours 57 minutes. mackerel-agent opens plugin processes every 60 seconds, and the timing of this periodically coincides with the loadavg recalculation. This is the cause of the phenomenon in which the loadavg periodically spikes when mackerel-agent is installed. This phenomenon has also been reported with other monitoring tools such as collectd and Telegraf

References


  1. You can verify that the sum of the coefficients is 1. This is a necessary condition for being a weighted average.

  2. In this entry HZ refers to the kernel timer frequency CONFIG_HZ, which must be distinguished from the userland frequency USER_HZ. USER_HZ is the value obtained from getconf CLK_TCK and /proc/stat, which is defined as 100 on x86. The kernel timer frequency can be checked with grep "CONFIG_HZ=" /boot/config-$(uname -r) or by watching the increment of watch -n1 "grep ^jiffies /proc/timer_list | head -n1".

  3. The branching procedure is required so that a system with no processes has a loadavg of 0.0 and a system with one process has 1.0. If this procedure does not exist, the loadavg will never reach 1.0, and if added without branching, it will not reach 0.0 on idle. It used to be that the value was rounded off, but the problem of not reaching 0.0 in an idle state was identified and it now the value is rounded up when the number of processes increases. For details, refer to the patch and the mailing list.