Showing: articles tagged with "Brian Clapper"
2012-04-12 21:32:52 Failing Forward While Stumbling, Eventually You Regain Your Balance
by Brian Clapper
First I want to start by saying I sincerely apologize for anyone adversely affected by yesterday's false alerts. That is something that we are very conscious of when rolling out new changes and clearly something I hope never to repeat.
How did it happen? First, a quick run down of the systems involved. As data is streamed into the system from the brokers, it is sent over RabbitMQ to a group of Complex Event Processors (CEP) running Esper and additionally the last collected value for each unique metric is stored in Redis for quick lookups. The CEPs are responsible for identifying when a value has triggered an alert, and then tell the notification system about it.
Yesterday we were working on a bug in the CEP system where under certain conditions, if a value went from bad to good, and we were restarting the service, it was possible we would never trigger an "all clear" event and as such your alert would never clear. After vigorously testing in our development environment, we thought we had it fixed and all our (known) corner cases tested.
So the change was deployed to one of the CEP systems to verify it in production. For the first few minutes all was well, stale alerts were clearing, I was a happy camper. Then roughly 5 minutes after the restart, all hell broke loose, every "on absence" alert fired, and then cleared within 1 minute, pagers went off around the office, happiness aborted.
Digging into the code we thought we spotted the problem, when we loaded the last value into the CEP from Redis, we need to do so in a particular order. Because we used multiple threads to load the data and let it do so asynchronously, some was being loaded in the proper order, but the vast majority was being loaded too late. Strike one for our dev environment. It doesn't have near the volume of data, so everything was loaded in order by chance. We fixed the concurrency issue, tested, redeployed, BOOM same behavior as before.
The next failure was a result of the grouping that we do in the Esper queries, we were grouping by the check id, the name of the metric and the target host being observed. The preload data was missing the target field. What this caused was the initial preload event to be inserted ok, then as we got new data in it would also be inserted just fine, but was being grouped differently. Our absence windows currently have a 5 minute timeout, so 5 minutes after boot, all the preload data would exit the window, which would now be empty and we triggered an alert. Then, as the newly collected data filled its window, we would send an all clear for that metric and at this point we would be running normally, albeit with a lot of false alerts getting cleaned up.
Unfortunately at this point, the redis servers didn't have the target information in their store, so a quick change was made to push that data into them. That rollout was a success, a little happiness was restored since something went right. After they had enough time to populate all the check data, changes were again rolled out to the CEP to add the target to the preload, hopes were high. We still at this point had only rolled the changes to the first CEP machine, so that was updated again, rebooted, and after 5 minutes things still looked solid, then the other systems were updated. BOOM.
The timing of this failure didn't make sense. CEP one had been running for 15 minutes now, and there are no timers in the system what would explain this behavior. Code was reviewed and looked correct. Upon review of the log files, we saw failures and recoveries on each CEP system, however they were being generated by different machines.
The reason for this was due to a recent scale out of the CEP infrastructure. Each CEP is connected to RabbitMQ to receive events, to split the processing amongst them each binds a set of routing keys for events it cares about. This splitting of events wasn't mimicked in the preload code, each CEP would be preloaded with all events. Since each system only cared about its share, the events it wasn't receiving would trigger an absence alert as it would see them in the preload and then never again. Since the CEP systems are decoupled, an event A on CEP one wouldn't be relayed to any other system, so they would not know that they needed to send a clear event since as far as they were concerned, everything was ok. Strike two for dev, we don't use that distributed setup there.
Once again the CEP was patched, this time the preloader was given the intelligence to construct the routing keys for each metric. At boot it would pull the list of keys its cared about from its config, and then as it pulled the data from Redis, it would compare what that metrics key would be to its list, if it had it, preload the data. One last time, roll changes, restart, wait, wait, longest 5 minutes in recent memory, wait some more... no boom!!!
At this point though, one of the initial problems I set out to solve was still an issue. Because data streaming in looked good, the CEP won't emit an all clear for no reason, it has to be bad first, so we had a lot of false alerts hanging out and people being reminded about them. To rectify this, I went into the primary DB, cleared all the alerts with a few updates, and rebooted the notification system so it would no longer see them as an issue. This stopped the reminders and brought us back to a state of peace. And this is where we sit now.
What are the lessons learned and how do we hope to prevent this in the future? Step 1 is, of course, always making sure dev matches production; not just in code, but in data volume and topology. Outside of the CEP setup it does, so we need a few new zones brought into the mix today and that will resolve that. Next, better staging and rollout procedure for this system. We can bring up a new CEP in production, give it a workload but have its events not generate real errors, going forward we will be verifying production traffic like this before a roll out.
Once again, sorry for the false positives. Disaster porn is a wonderful learning experience, and if any of the problems mentioned in this post hit home, I hope it gets you thinking about what changes you might need to be making. For updates on outages or general system information, remember to follow circonusops on Twitter.
2011-10-14 19:11:38 Template API
by Brian Clapper
Setting up a monitoring system can be a lot of work, especially if you are a large corporation with hundreds or thousands of hosts. Regardless of the size of your business, it still takes time to figure out what you want to monitor, how you are going to get at the data, and then to start collecting, but in the end it is very rewarding to know you have insight.
When we launched Circonus, we had an API to do nearly everything that could be done via the web UI (within reason) and expected it to make it easy for people to program against and get their monitoring off the ground quickly. Quite a few customers did just that, but still wanted an easier way to get started.
Today we are releasing the first version of our templating API to help you get going (templating will also be available via the web UI in the near future). With this new API you can create a service template by choosing a host and a group of check bundles as "masters." Then you simply attach new hosts to the template, and the checks are created for you and deployed on the agents. Check out the documentation for full details.
Once a check is associated with a template, it cannot be changed on its own…you must alter the master check first and then re-sync the template. To re-sync, you just need to GET the current template definition and then POST it back; the system will take care of it from there.
To remove bundles or hosts, just remove them from the JSON payload before POSTing, and choose a removal method. Likewise, to add a host or bundle back to a template, just add it into the payload and then POST. We offer a few different removal and reactivation methods to make it easy to keep or remove your data and to start collecting it again. These methods are documented in the notes section of the documentation.
Future plans for templates include syncing rules across checks and adding templated graphs so that adding a new host will automatically add the appropriate metrics to a graph. Keep an eye on our change log for enhancements.
2010-10-22 15:55:05 WebHook Notifications
by Brian Clapper
This week we added support for webhook notifications in Circonus. For those that are unsure what a webhook is, its simply an HTTP POST with all the information about an alert you would normally get via email, XMPP or AIM.
Webhooks can be added to any contact group. Unlike other methods, you can't add one to an individual user, and then add that user to a group, however this might be supported in the future based on feedback. Simply go to your account profile, click on the field "Type to Add New Contact" on the group you would like to add the hook to, and enter the URL you would like us to contact. The contact type will then display as your URL with the method of HTTP (for brevity).
Now that your hook is setup, what will it look like when the data is posted to you? Here is a perl Data::Dumper example, grouped by alert for readability, of the parameters posted for 2 alerts:
%post = ( 'alert_id' => [ '21190', '21191' ], 'account_name' => 'My Account', 'severity_21190' => '1', 'metric_name_21190' => 'A', 'check_name_21190' => 'My Check', 'agent_21190' => 'Ashburn, VA, US', 'alert_value_21190' => '91.0', 'clear_value_21190' => '0.0', 'alert_time_21190' => 'Thu, 21 Oct 2010 16:35:49', 'clear_time_21190' => 'Thu, 21 Oct 2010 16:36:49', 'alert_url_21190' => 'https://circonus.com/account/my_account/fault-detection?alert_id=21190', 'severity_21191' => '1', 'metric_name_21191' => 'B', 'check_name_21191' => 'My Other Check', 'agent_21191' => 'Ashburn, VA, US', 'alert_value_21191' => '91.0', 'alert_time_21191' => 'Thu, 21 Oct 2010 16:36:21', 'alert_url_21191' => 'https://circonus.com/account/my_account/fault-detection?alert_id=21191', );
So lets look at what we have here. First thing to notice is that we pass multiple alert_id parameters, giving you the ID of each alert in the payload. From there, every other parameter is suffixed with _<alert_id> so you know which alert that parameter is associated with. In this example 21190 is a recovery, and 21191 is an alert, recoveries get the additional parameters of clear_value and clear_time.
Webhooks open up all sorts of possibilities both inside and outside of Circonus. Maybe you have a crazy complicated paging schedule, or prefer a contact method that we don't natively support yet, fair enough, let us post the data to you and you can integrate it however you like. Want to graph your alerts? We are in the process of working on a way to overlay alerts on any graphs, but in the meantime, setup your webhook and feed the data back to Circonus via Resmon XML, now you have data for your graphs.
If you are curious about other features and would like to see an in depth post on them, please contact us at hello@circonus.com.