2012-04-12 21:32:52 Failing Forward While Stumbling, Eventually You Regain Your Balance

by Brian Clapper

First I want to start by saying I sincerely apologize for anyone adversely affected by yesterday's false alerts. That is something that we are very conscious of when rolling out new changes and clearly something I hope never to repeat.

How did it happen? First, a quick run down of the systems involved. As data is streamed into the system from the brokers, it is sent over RabbitMQ to a group of Complex Event Processors (CEP) running Esper and additionally the last collected value for each unique metric is stored in Redis for quick lookups. The CEPs are responsible for identifying when a value has triggered an alert, and then tell the notification system about it.

Yesterday we were working on a bug in the CEP system where under certain conditions, if a value went from bad to good, and we were restarting the service, it was possible we would never trigger an "all clear" event and as such your alert would never clear. After vigorously testing in our development environment, we thought we had it fixed and all our (known) corner cases tested.

So the change was deployed to one of the CEP systems to verify it in production. For the first few minutes all was well, stale alerts were clearing, I was a happy camper. Then roughly 5 minutes after the restart, all hell broke loose, every "on absence" alert fired, and then cleared within 1 minute, pagers went off around the office, happiness aborted.

Digging into the code we thought we spotted the problem, when we loaded the last value into the CEP from Redis, we need to do so in a particular order. Because we used multiple threads to load the data and let it do so asynchronously, some was being loaded in the proper order, but the vast majority was being loaded too late. Strike one for our dev environment. It doesn't have near the volume of data, so everything was loaded in order by chance. We fixed the concurrency issue, tested, redeployed, BOOM same behavior as before.

The next failure was a result of the grouping that we do in the Esper queries, we were grouping by the check id, the name of the metric and the target host being observed. The preload data was missing the target field. What this caused was the initial preload event to be inserted ok, then as we got new data in it would also be inserted just fine, but was being grouped differently. Our absence windows currently have a 5 minute timeout, so 5 minutes after boot, all the preload data would exit the window, which would now be empty and we triggered an alert. Then, as the newly collected data filled its window, we would send an all clear for that metric and at this point we would be running normally, albeit with a lot of false alerts getting cleaned up.

Unfortunately at this point, the redis servers didn't have the target information in their store, so a quick change was made to push that data into them. That rollout was a success, a little happiness was restored since something went right. After they had enough time to populate all the check data, changes were again rolled out to the CEP to add the target to the preload, hopes were high. We still at this point had only rolled the changes to the first CEP machine, so that was updated again, rebooted, and after 5 minutes things still looked solid, then the other systems were updated. BOOM.

The timing of this failure didn't make sense. CEP one had been running for 15 minutes now, and there are no timers in the system what would explain this behavior. Code was reviewed and looked correct. Upon review of the log files, we saw failures and recoveries on each CEP system, however they were being generated by different machines.

The reason for this was due to a recent scale out of the CEP infrastructure. Each CEP is connected to RabbitMQ to receive events, to split the processing amongst them each binds a set of routing keys for events it cares about. This splitting of events wasn't mimicked in the preload code, each CEP would be preloaded with all events. Since each system only cared about its share, the events it wasn't receiving would trigger an absence alert as it would see them in the preload and then never again. Since the CEP systems are decoupled, an event A on CEP one wouldn't be relayed to any other system, so they would not know that they needed to send a clear event since as far as they were concerned, everything was ok. Strike two for dev, we don't use that distributed setup there.

Once again the CEP was patched, this time the preloader was given the intelligence to construct the routing keys for each metric. At boot it would pull the list of keys its cared about from its config, and then as it pulled the data from Redis, it would compare what that metrics key would be to its list, if it had it, preload the data. One last time, roll changes, restart, wait, wait, longest 5 minutes in recent memory, wait some more... no boom!!!

At this point though, one of the initial problems I set out to solve was still an issue. Because data streaming in looked good, the CEP won't emit an all clear for no reason, it has to be bad first, so we had a lot of false alerts hanging out and people being reminded about them. To rectify this, I went into the primary DB, cleared all the alerts with a few updates, and rebooted the notification system so it would no longer see them as an issue. This stopped the reminders and brought us back to a state of peace. And this is where we sit now.

What are the lessons learned and how do we hope to prevent this in the future? Step 1 is, of course, always making sure dev matches production; not just in code, but in data volume and topology. Outside of the CEP setup it does, so we need a few new zones brought into the mix today and that will resolve that. Next, better staging and rollout procedure for this system. We can bring up a new CEP in production, give it a workload but have its events not generate real errors, going forward we will be verifying production traffic like this before a roll out.

Once again, sorry for the false positives. Disaster porn is a wonderful learning experience, and if any of the problems mentioned in this post hit home, I hope it gets you thinking about what changes you might need to be making. For updates on outages or general system information, remember to follow circonusops on Twitter.

2012-02-21 21:52:32 Graph Annotations and Events

by Charlie Fiskeaux II

This feature has been a long time in coming: the ability to annotate your graphs! With the new annotations timeline sitting over the graph, not only can you create custom events to mark points in time, but you can also view alerts and see how they fit (or don't fit) your metric data.

Annotations Timeline

part of a screenshot of the new annotations interfaceFirst, let's go to a graph and take a look at the annotations timeline to see how it works. When you choose a graph and view it, you will immediately see the new Annotation controls to the left side of the date tools, and the timeline itself will render in between the date tools and the graph itself. The timeline defaults to collapsed mode and by default will only show alerts from metrics on the current graph, so you may have an empty timeline at first. If you take a look at the controls, however, you will see three items: the Annotation menu, the show/hide toggle button, and the expand/collapse toggle button. The show/hide button does just what it says: it shows or hides the timeline. The expand/collapse button toggles between the space-saving collapsed timeline view and the more informative expanded timeline view.

If you open the Annotation menu, you will see a list of all the items you can possibly show in your timeline (or hide from it). Any selections you make here (as well as your show/hide and expand/collapse state changes) will be saved as site-wide user preferences in your current browser. All the items are separated into three groups:

Event Categories
This is a list of all the Event categories under the current account (these are seen and managed in the Events section of the site…we'll get to that new section in a minute). If you have uncategorized events (due to deleting a category that was still in use), they will appear grouped under the "--" pseudo-category label.
Alerts
By default, the only alerts that will be shown will be alerts of all severity (sev) levels triggered by metrics on the current graph. If you wish, you may also show all alerts, and both categories of alerts may be filtered by sev levels. To do so, click one of the alert labels to expand a sev filter row with more checkboxes.
Text Metrics
This third group is not shown by default, but is represented by the checkbox at the bottom labeled "Include text metrics." If you check this box, the page will refresh, and any text metrics on the current graph will then be rendered as a part of the timeline (and will be excluded from the graph plot and legend).

Once you have some annotations rendering on the timeline, take a look at the timeline itself. Hovering over a point will show a detail tooltip with the annotation title, date, and description, and hovering over either a point or a line segment will highlight the corresponding date range on the graph itself.

Now for the question on everyone's minds: "Can I create events here, or do I have to go to the Events section to do that?" The answer is, yes, you can create events straight from the view graph page! To do so, simply use your right mouse button to drag-select a time range on the graph itself. A dialog will then popup for you to input your info and create the event.

Events Section

Now let's head over to the Events section where you can manage your events and event categories. Simply click on the new Events tab (below the Graphs tab) and you're there! To create an event, click the standard "+" tab at the upper left of the page. This will give you the New Event dialog. Most of the dialog inputs are pretty straightforward, with the exception of the category dropdown. This is a new hybrid "editable" dropdown input.the category select dropdown input in the new event dialog You may select any of its options if you'd like, or you can add new ones. To add a new option, simply select the last option (it's labeled "+ ADD Category"). Your cursor will immediately be placed in a standard text input where you can enter your new category. When you're finished, hit enter to create the new option and have it selected as your category of choice.

After you have created your event, you may need to edit it later. To edit any of its details, simply click on the pertinent detail of the event (when changing the event category, you will see it also has the new hybrid "editable" dropdown input which works exactly like the one in the New Event dialog).

In addition to start and end points (which may be the same date if you don't want more than a single point), you may also add midpoints to your event. Click the Show details button for an event (the arrow button at the right end of an event row), and you will see the Midpoints list taking up the right half of the event details panel. Simply click the Add Midpoint button to get the New Midpoint dialog where you enter a title, description and choose a date for your point.

The one last element of the Events section that's good to know about is the Categories menu at the upper right of the page. This allows you to delete categories as well as filter the Events list to only show a single category of events at a time. To do this, just click the name of a category in the Categories menu.

2011-12-28 21:09:42 Insights from a Data Center Conference

by James Sivis

At the beginning of this month, I’d attended the Gartner Data Center Conference in Las Vegas, and wanted to share with you some of my gained impressions and insights from the event.

First, I have to say that I have seldom seen a group of more conscientious conference attendees (aside from Surge, of course, and a physics conference I once attended). Networking breakfasts were busy, sessions were well attended, and both lunch and topic-specific networking gatherings had lively discussions. Each of the Solution Center hours, going well into the evening, were full of people not only partaking of the food or giveaways but were primarily and voraciously soaking up information from the various exhibitors. Even in hallways during the day, while people were sitting or standing, there was a steady exchange of opinions and information. This is what I saw throughout the conference; attendees there were very serious about learning…from the speakers, vendors, and from their peers. Relatedly, it’s interesting that many organizations bar outright their employees from attending any events in Vegas – while boondoggle may be an appropriate term for some shows in that or any other location, it certainly wasn’t the case with this conference.

Now let’s get to what frequently was foremost on the mind of attendees. I was somewhat surprised to find that this was not something that is usually on the top-ten lists of CIO/IT initiates. Rather, what repeatedly came out first in terms of attendees’ pressing interest were the interrelated topics of avoiding IT outages and increasing speed of service recovery, along with monitoring to help with both of these.

Granted, this was a datacenter-specific conference so it’s natural that avoidance of and recovery from operational failures is of paramount importance. But note that there are lots of other overarching datacenter initiatives we all hear much more about, such as virtualization, cloud migration, datacenter consolidation. Many of these headline-grabbing topics are certainly both important, and getting done. However, what affects datacenter operations leaders’ daily lives and careers, and so is of primary importance, has not received much if any notice or press.

Why is that? It’s pretty simple. Some of these other initiatives are new. Monitoring has been around seemingly forever, plus (to an extent) outages are taken as being somewhat unavoidable. Yet, while zero failures is indeed not possible, markedly increased reliability is certainly attainable. Look at the historical telecom service provider side, where five-9’s reliability is the expected level of service. When expectations are high, and commensurate investment is made, higher levels are not at all out of reach.

As for monitoring solutions themselves, nowadays you don’t have to be limited to old-school systems. There are young companies, like Circonus, who have a fresh approach that breaks down the silos of stand-alone toolsets of the past.

Let’s take a step back now and visualize what outages look like from a datacenter ops teams perspective, i.e. what happens when things “blow up” in a datacenter. It’s not external constituents such as clients that directly impact the datacenter for the most part. External clients touch the business units and it’s then the business units which put the heat on the datacenter leaders.

And what about SLA’s for keeping business units apprised of the benefit IT delivers to them? As I heard loud and clear in the conference, internal SLA’s are for the most part useless. Why? Because they don’t mean much to the business units—they’re only interested in “When are you going to get my service back up?!” In other words, this is a variation on, “What have you done for me lately?”

So let’s look at an option for resolution. If the problem occurs on a virtual machine, you just spin up a new instance, right? Wrong, but that’s what usually happens. When a hammer dangling off a shelf hits you on the head, do you replace it with another dangling hammer and think you’ve solved the problem? Obviously, the thing to do in a datacenter is do the work to avoid repetition of the issue—we’re talking root-cause-analysis—otherwise you’re putting out fires repeatedly…the same fires.

Now a good monitoring system is going to help and in several ways. First, as just mentioned, it’s going to assist in identifying the underlying issue, including its location—is it in the app, the database, the server, etc. You don’t want to do that blindly testing—you’ll want the capability to create graphs on the fly and you similarly want to able to very easily and quickly do correlations of your metrics.

Okay, so that’s good for remediating a problem along with reducing the chance of it recurring, but you’ll also want to do anticipatory actions like capacity planning to forestall avoidable bottlenecks. For this you also want an easy-to-use tool so that you don’t have to muck around with spreadsheets. And you’ll want to be able to have a “play” function so that when you do things such as code-pushes, you’ll be able to see in real-time the effect of these changes. This way, if the effect of the code-push is negative, you can quickly reverse the action without impacting your internal or external clients.

The good news is that new solutions with all these functionalities are out there in the marketplace. Of course, before you buy one be sure to insist on testing the solution in a trial to see how it performs, in your current and anticipated (read: hybrid physical and virtual/Cloud) environments. This includes seeing how the solution handles your scale, both backend and from a UI perspective. Such an evaluation will require an investment in your time, but the result will be well worth it, in the increased avoidance of outages and speeding up of recovery from them.

2011-12-13 19:21:15 Monitoring your Vitals During the Critical Holiday Retail Season

by Robert Treat

As with Brick & Mortar stores, the Holiday season is a critical time for many E-Commerce sites. Like their off-line brethren, these sites also see large increases in both traffic and revenue, sometimes substantially so. Of course these changes in user behavior don't just affect E-Commerce sites; consider a social-networking site like Foursquare, where a person might normally check into 3 or 4 places a week, during the Holiday season that might double as they visit more stores and end up eating out more often while rushing between those stores. On an individual basis it doesn't sound that significant, but if a large percentage of your user base doubles their traffic, you better hope you have planned accordingly.

On the technical side, many sites will actually change their regular development process in order to handle these changes in user behavior.Starting early in November, many sites will stop rolling out new features and halt large projects that might be disruptive to the site or the underlying infrastructure. As focus shifts away from features,most often it turns back towards infrastructure and optimization. Adding new monitoring, from improved logging to new metrics and graphs, becomes critical as you seek to have a comprehensive view of your sites operations so that you can better understand the changes in traffic that are happening, and hopefully be proactive about solving problems before they turn into outages.

Profiling and optimization work also receives more attention during this time; studies continue to show correlations between page load speeds and website responsiveness to increased revenue, and being able to improve these areas is something that can typically be done without having to change the behavior of how things work. Bugfixes are also a popular target during these times as those corner cases are more likely to show up as traffic increases, especially if you tend to see new users as well as an increase in use by existing users.

This brings us to a good question; just what are you monitoring? For most shops there tend to be standard graphs that get generated for this like disk space or memory usage. These things are good to have, but they only scratch the surface. Your operations staff probably knows all kind of metrics about the system the need to monitor, but how about your application developers? They should know the code that runs your site inside and out, so challenge them to find key metrics in your application stack that are important for their work. Maybe that's messages delivered to a queuing system, or the time it takes to process the shipping costs module, or measuring the responsiveness of a 3rd party API like Facebook or Twitter. But don't stop there;everyone in your company should be asking themselves "what analytics could I use to make better informed decisions"? For example, do you know if your increased traffic is due to new users or existing users? If you are monitoring new user sign ups, this will start to give you some insight. If you are doing E-Commerce, you should also be tracking revenue related numbers. Those types of monitors are more business focused but they are critical to everyone at your company. So much so that at Etsy, a top 100 website commonly known as "the worlds handmade marketplace", they project these types of metrics right out in public.

Ideally once you have this type of information being logged, you can collect the information for analytically reports and historical trending via graphs. You want to be able to take the data you are collecting and correlate between metrics. Given a 10% increase in new users in the past week, we've seen a 15% spike in web server traffic.If we project those numbers out, can we make it through Black Friday? Cyber Tuesday? Will we make all the way to New Years, or do we need to start provisioning new machines *NOW*? Or what happens if our business model changes, and we are required to live through a "black friday" event every day? That's the kind of challenges that social shopping site Gilt faces, with it's daily turnover of inventory. It's worth saying that you won't need all of this information real time, but ideally you'll be able to get a mix of real time, near-time (5 minutes aggregated data is common), as well as daily analytical reports. Additionally you should talk with your operations staff about which of these metrics are mission critical enough that we should be alerting on them, to make sure we have the operational and organizational focus that is appropriate.

While nothing beats preparation, even the best laid plans need good feedback loops to be successful. Measuring, collecting, analyzing, and acting upon data as it comes into your organization is critical in today's online environments. You may not be able to predict the future, but having solid monitoring systems in place will help you to recognize problems before they become critical, and help give you a "snowballs chance" during the holiday season.

2011-11-22 18:44:38 Template Web UI

by Charlie Fiskeaux II

Back in October we released the first version of our new Templating API, allowing you to easily replicate sets of bundles across multiple hosts. Now we bring you the time-saving sweetness of Templates in the web interface as well; if you have multiple servers that you want to monitor in exactly the same way, Templates are your friend. The idea behind them is pretty simple: you choose your master host, and select one or more of its check bundles to be used as master bundles. Then when you select your target hosts, the master bundles are copied and applied to the target hosts.

Creating A Template

three check icons enclosed in a box, representing a templateSo let's look at how the Templating process works. Before you create a template, you first need to ensure that you have your master check bundles set up and active on your master host. Once that's the case, start by going to the new "data" section of Circonus and visiting the "Templates" tab at the left. Create a new template via the "+" tab (or the "Create A Template" button in the middle of the page if you have no templates yet). In the resulting dialog, type a name for your template and choose a master host, and when you click "OK" you will see the templates table appear with a row for your new template (as usual, click the summary row to view the expanded details of the template).

When you first create a template, it's in "draft" mode. This means that it's only saved in your browser's memory until you apply it. Nothing has been saved to the system yet, and the master bundles haven't been replicated. This allows you to lay out templates and modify them or discard them before making any changes to the system. If you wish to save changes to a draft you may do so via the "Save" button; the draft is not applied as a regular Template until you click the "Apply" button. To aid in visually scanning the list of Templates for drafts, drafts will always appear at the top of your list, and will always be green. (If at any point you wish to change your template name or master host, you may click them in the summary row to edit them in-place. Please note: when changing your master host, you may only choose among the target hosts currently saved in the Template.)

Choosing Bundles

Once you've created your draft, you need to choose your master check bundles. Under the "Check Bundles" section at the left, click "Add Bundle" to bring up the new bundle dialog. All the bundles available for your master host will be shown here in a scrollable list. This is a selectable list, so when you select a bundle, it's shown as selected in the list until you remove it from the Template. If you have a long list of bundles and are having a difficult time finding the ones you want, you may use the field above the list to filter the shown bundles by a filter string or regular expression (if you're using a regular expression, don't include the leading and trailing slashes, just use the desired RegEx syntax). After you have chosen a bundle, you may change its name by clicking on it in the list of chosen bundles. (Please Note: the reserved string "{target}" will be replaced by the current hostname/IP as the bundle is replicated across the target hosts.)

Choosing Hosts

Choosing your target hosts works mostly the same way as choosing the master check bundles. The "Add Host" button brings up a dialog with a scrollable, selectable, filterable list of available hosts on your account, and you may choose one or more of those hosts. There is an additional feature, however, which is the "Enter a new host" field below the list. This allows you to enter new hosts (either IP addresses or domain names are acceptable) that aren't currently used on your account. When you enter a new host and hit return/enter, the new host will be subject to a DNS check to ensure that it really exists; if it passes the DNS check, it will then be added to your list of target hosts.

Once you're satisfied with your bundle and host choices, clicking the "Apply" button will replicate your master bundles across each target host and will save the template in the database.

Modifying A Template

the action dropdown selectOnce your Template is saved you will see several things change in the details panel. Each bundle and host will get checkboxes, and two "Action" dropdown selects will appear, one above the check bundles list and one above the target hosts list. Now that the bundles and hosts are a part of the template, if you wish to modify or remove them, you will need to check their checkboxes and choose an action from the appropriate dropdown before saving. There are four actions available:

Remove
When used on a bundle, it will delete the target bundles and remove them from the Template. When used on a host, it will delete the host's bundles and show the host as inactive in the host list.
Unbind
When used on a bundle, it will leave the target bundles in place but will break their synchronization with the template and show them as inactive in the bundle list. When used on a host, it will leave the host's bundles in place but will break their synchronization with the template and show the host as inactive in the host list.
Deactivate
When used on a bundle, it will deactivate the target bundles and show them as inactive in the bundle list. When used on a host, it will deactivate the host's bundles and show the host as inactive in the host list.
Restore
When used on a bundle, it will reactivate, rebind, or recreate target bundles as necessary, to restore them to active status and synchronization with the template. When used on a host, it will reactivate, rebind, or recreate the host's bundles as necessary, to restore them to active status and synchronization with the template.

Staying In-Sync

re-sync buttonAfter creating and applying a Template, you are still allowed to edit the master check bundles. If you do so, any Templates using those check bundles as master bundles will be out-of-sync. When you go to your Templates page, the out-of-sync Templates will have their sync buttons activated and the buttons will say "Re-Sync." Simply click the "Re-Sync" button to replicate the bundle changes across all the target bundles, and the Template will be in-sync again.

(Please Note: if at any point you wish to delete the template, any active bundles that are still a part of the template will be deleted from the target hosts. If you wish to keep the bundles on the target hosts but just delete the template, you will need to unbind all the bundles you wish to keep on the target hosts and then delete the template.)

2011-10-14 19:11:38 Template API

by Brian Clapper

Setting up a monitoring system can be a lot of work, especially if you are a large corporation with hundreds or thousands of hosts. Regardless of the size of your business, it still takes time to figure out what you want to monitor, how you are going to get at the data, and then to start collecting, but in the end it is very rewarding to know you have insight.

When we launched Circonus, we had an API to do nearly everything that could be done via the web UI (within reason) and expected it to make it easy for people to program against and get their monitoring off the ground quickly. Quite a few customers did just that, but still wanted an easier way to get started.

Today we are releasing the first version of our templating API to help you get going (templating will also be available via the web UI in the near future). With this new API you can create a service template by choosing a host and a group of check bundles as "masters." Then you simply attach new hosts to the template, and the checks are created for you and deployed on the agents. Check out the documentation for full details.

Once a check is associated with a template, it cannot be changed on its own…you must alter the master check first and then re-sync the template. To re-sync, you just need to GET the current template definition and then POST it back; the system will take care of it from there.

To remove bundles or hosts, just remove them from the JSON payload before POSTing, and choose a removal method. Likewise, to add a host or bundle back to a template, just add it into the payload and then POST. We offer a few different removal and reactivation methods to make it easy to keep or remove your data and to start collecting it again. These methods are documented in the notes section of the documentation.

Future plans for templates include syncing rules across checks and adding templated graphs so that adding a new host will automatically add the appropriate metrics to a graph. Keep an eye on our change log for enhancements.

2011-09-23 19:10:38 One Dashboard to Rule Them All

by Charlie Fiskeaux II

four icons representing a dashboardEver dream of having a systems monitoring dashboard that was actually useful? One where you could move things around, resize them, and even choose what information you wanted to display? Large enterprise software packages may have decent dashboards, but what if you’re not a large enterprise or you don’t want to pay an arm and a leg for bloatware? Perhaps you have a good dashboard that came with a specific server or piece of hardware, but it’s narrowly-focused and inflexible. You’ve probably thought about (or even tried) creating your own dashboard, but it’s a significant undertaking that’s not for the faint-of-heart. What’s the solution? Should we just learn to live with sub-optimal monitoring tools?

Here at Circonus, we decided that this was one problem we could eliminate. Since we’ve built a SaaS offering that’s flexible enough to handle multiple different data sources, why shouldn’t we build a dashboard that’s flexible enough to display them? So we created a configurable dashboard that lets you monitor your data however you want. Do you want to show graphs side-by-side but at different sizes? Done. Want an up-to-date list of alerts beside those graphs? Easy. How about some real-time metric charts that automatically refresh? No problem. Our new configurable dashboards allow you to add all these items and more. Let’s dig in and see how these new dashboards work.

Dashboard Basics

Start by going to the standard “Dashboard” and clicking the new “My Dashboards” tab. These dashboards are truly yours; any dashboards you create are only visible to you (by default) and are segregated by account. If you want to share a custom dashboard with everyone else on an account, check that dashboard’s “share” checkbox in your list of custom dashboards.

After you have created a custom dashboard, you may set it to be your default dashboard by using the radio buttons down the left side of your custom dashboards list. If you do this, you will be greeted with your selected dashboard when you login to Circonus. By selecting the “Standard Circonus Dashboard” as your default dashboard, you will revert to being greeted with the old dashboard you’re already used to seeing.

part of the interface for creating a new dashboard layout

To create a new custom dashboard, click the “+” tab and choose a layout. At first you will see only a couple predefined layouts available, but after you create a dashboard, its layout will then be available to choose when creating other new dashboards.

Now a note about working with these dashboards: every action auto-saves so you never have to worry about losing changes you’ve made. However, if you haven’t given your dashboard a title, the dashboard isn’t permanently saved yet. If you forget to title your dashboard and go off to do other things, don’t worry, the dashboard you created is saved in your browser’s memory. All you have to do is visit the “My Dashboards” page and your dashboard will be listed there. With two clicks you can give your dashboard a title and save it permanently. (Please note our minimum browser requirements—Firefox 4+ or Chrome—which are especially applicable for these new custom dashboards, since we’re using some features which are not available in older browsers.)

So let's create a dashboard. Choose a layout, click “Create Dashboard,” and you will be taken to the new dashboard with the “Add A Widget” panel extended. To begin, let’s check out the title area. Notice that when you hover over the title, a dropdown menu appears. This lists your other dashboards on the current account (as well as dashboards shared by other account members) and is useful for quickly switching between dashboards.

the dashboard interface showing the dashboard controls icons

To the right of the title are some icons. The first icon opens the grid options dialog, which lets you change the dimensions of the dashboard grid, hide the grid (it’s still active and usable, though), enable or disable text scaling, and choose whether or not to auto-hide the title bar in fullscreen mode. The second icon toggles fullscreen mode on and off. Once you enter fullscreen mode a third icon will appear, and this icon toggles the “Black Dash” theme (this theme is only available in fullscreen mode). The current states of both fullscreen mode and the “Black Dash” theme are saved with your dashboard.

One other note about the dashboard interface: if you leave a dashboard sitting for more than ten or fifteen seconds and notice that parts of the interface disappear (along with the mouse cursor), don’t worry…it’s just gone to sleep! A move of the mouse will make everything visible again. (If there are any widget settings panels open, though, the sleep timer will not activate.)

Widgets

Now for the meat of it all: widgets. We currently have ten widgets which can be added to the dashboard grid to show various types of data, and we’ll be adding more widget types and contents in the future. Following is a quick rundown of the currently available widgets:

Graph
Graph widgets let you add existing graphs to your dashboard. You may choose any graph from the “My Graphs” section under your current account. Graph widgets are refreshed every few minutes to ensure they’re always up-to-date.
Beacon Map
Map widgets let you add existing Beacon maps to your dashboard. You may choose any map query from the “Beacons” page (under the “Checks” section of your current account). Map widgets are updated in real-time.
Beacon Table
Table widgets let you add existing Beacon tables to your dashboard. You may choose any table query from the “Beacons” page (under the “Checks” section of your current account). Table widgets are updated in real-time.
Chart
Chart widgets let you select multiple metrics to monitor and compare in a bar or pie chart. Chart widgets are updated in real-time.
Gauge
Gauge widgets let you monitor the current state of a single numeric metric in a graphical manner, displaying the most recent value on a bar gauge (dial gauges are coming soon). Gauge widgets are updated in real-time.
Status
Status widgets let you monitor the current state of one or more metrics, displaying the most recent value with custom formatting. This is most useful for text metrics, but it may be used for numeric metrics as well. Status widgets are updated in real-time.
HTML
HTML widgets let you embed arbitrary HTML content on your dashboard. It can be used for just about anything, from displaying a logo or graphic to using an iframe to embed more in-depth content. Everything is permissible except Javascript. HTML widgets are refreshed every few minutes to ensure they’re always up-to-date.
List
List widgets let you add lists of graphs and worksheets to your dashboard, ordered by their last modified date. You may specify how many items to list and (optionally) a search string to limit the list. List widgets are refreshed every few minutes to ensure they’re always up-to-date.
Alerts
Alerts widgets let you monitor your checks by showing the most recent alerts on your current account. You may filter the alerts by their age (how long ago they occurred), by particular search terms, by severity levels, or other status criteria. Alerts widgets are refreshed every few minutes to ensure they’re always up-to-date.
Admin
Admin widgets let you monitor selected administrative information, including the status of all Circonus agents on your current account. Admin widgets are refreshed every few minutes to ensure they’re always up-to-date.

icons representing some of the current widget types

To add widgets to the dashboard grid, there are two methods: you may use the “drag-and-drop” method (dragging from the “Add a Widget” panel), or you may first click the target grid cell and then select the widget you want to place there. (Note: in fullscreen mode only the latter method is available.) After a widget has been added, some types of widgets will automatically activate with default settings, but most will be inactive. If the widget is inactive, click it to open the settings panel and get started. Once the widget is activated, the settings panel is available by clicking the settings icon in the upper right corner of the widget. In the lower right corner of the widget is the resize handle, so you can resize the widget as frequently as you want. And let’s not forget being able to rearrange the widgets—every widget has a transparent “title bar” at its top which you can use to drag it around. I won’t get into the details of settings for every type of widget, because they should be self-explanatory (and that would make this one super-long blog post). But suffice it to say, there are plenty of options for everyone.

We've been working hard to create a configurable dashboard that will be as flexible as Circonus itself is, and we believe we’ve hit pretty close to the mark. Here’s a sample dashboard showing the power of these new dashboards:

dashboard grid with several rectangular graph, chart, alerts and status widgets arranged in a grid

2011-08-16 23:53:35 What's in a number?

by Theo Schlossnagle

Numbers, numbers, numbers; we're all about numbers here at Circonus. We have trillions of data points which we feed into a slew of algorithms and processes to help our users identify problems with their data. But what are these numbers? It turns out that isn't an easy question to answer.

Like most monitoring systems, Circonus performs an action from which it extracts one or more "metrics." A common example is running a database query and measuring both the correctness of the result (as a boolean: good vs. bad) and the latency with which the answer was delivered. Similarly, it could load a web page, ensure that some specified content is successfully returned and measure the time it took. More concretely, when performing an HTTP transaction, it could obtain the following useful metrics: time to establish the TCP connection, time until the first byte of data is received, and time until the last byte of data is received. These measurements can reveal a variety of problems both on the surface of your architecture as well as provide indications of issues deep within.

While most monitoring systems (and parts of Circonus) work this way, the nature of these metrics is most interesting in what it is missing. In other words, it is vital to understand what they do not tell you. You are not observing real information; instead you are producing a single synthetic event and measuring it. The data are not real (and worse, may be far from representative.) Before I dive in and talk about why these data aren't "good," I'll talk a bit about why they are "good enough" for many things.

Synthetic measurements work very well for components that can be measured in terms of quantities or rates. How many of something do you have? How quickly is it increasing or decreasing? Simple things like this are: disk space, I/O operations per second, the number of HTTP requests serviced, CPU usage, memory usage, etc. The most important factor is that these things are one-dimensional.

Data like these are both easy to visualize and critically important for things like anomaly detection and capacity planning. Being of a single dimension, understanding patterns in the data is easier for both humans and computers. However, as we start combining these data points, the world goes quickly out of focus.

For the moment, let's assume we measure total money spent on an e-commerce site (you'd be crazy to not measure this.) In addition to that, we measure total transactions performed (number of sales.) With these metrics, we have some clear data: total dollars and dollars/hour (by deriving the samples) and total sales and sales/hour (again by deriving.) These numbers are pretty clear and we can make some good judgments about what to expect from day to day. However, you might ask, "How much is the average transaction size?" The answer to this question is simple: total money spent divided by total sales. Unfortunately, the average is not a useful number; just ask any statistician.

When you start looking at averages, you start losing information. We use averages to zoom out on graphs; you might notice that when you have a sudden spike (let's say in traffic) you will see a much higher spike when zoomed in than when zoomed out. Why? If you were serving between 2900 and 3300 requests per second between 7pm and 8pm except for a sudden spike of 5400 requests per second between 7:40 and 7:45, you would see that on a graph showing 5 minute averages. However, on a graph zoomed out far enough to show only 20 minute averages, you'd see a deceptively small spike of about 3400 rps at that time period. As long as you can zoom in on the time series, it can be an acceptable compromise to reduce the data volume down to something consumable by a mere human being. Then the obvious question is: when does this go horribly wrong?

Let's look at something like web page load times. If you run a synthetic transaction, always from the same location, you can track measurements in that single dimension. Things should be somewhat consistent and these numbers are useful. However, they do not tell you how fast your site is. Only your users know that. Interestingly, since your users access your web site, you can actually have them report that information back to you. In fact, this is how most web analytics systems work. The interesting part here is that you have a wide variety of data coming in representing a distribution of perceived load times. Some people load your pages quickly and others load them slowly. That's the nature of the Internet: inconsistency. The key is that they don't "trend" as a single datapoint that is the average of all.

The inconsistency in these data is interesting: it can be leveraged for improvements and advantage. Understanding (and eventually changing) the distribution of these data can radically change your business. There have been many articles written about web page load times, so in order to keep this fresh, I'll discuss database transactions. The reason I'm jumping around here is because data are just data -- this applies to every metric you can observe.

Understanding that your average database query takes 1.92ms to complete is, I'm sorry to say, useless. The problem is that you are likely running thousands or tens of thousands of queries per second and none of them are average. To illustrate this, here are three (contrived) database query latency histograms each of 39 samples.

The interesting (and perhaps deceptive) part is that all three have an average latency across all queries of 1.92ms. Quite clearly, all depict radically different situations. The truth is, when you have a lot of data (thousands to hundreds of thousands of data points), the histogram reveals the information you seek and the average hides it.

Why is this so interesting? In computing, there are a lot of things we can witness by actively measuring them; this is what the Circonus you know and love has done. We figured it was time to change the game a bit and help you visualize, in real-time, the things that happen in your business: enter BizEKG.

BizEKG allows you to analyze events (like webpage loads, database queries, customer service telephone calls, etc.). Not just some, not just a sample, but all the events. From there, you can break them apart, run statistical analysis (including histograms, of course) and understand your data. There are a handful of real-time web analytics companies out there, but answering these questions in "Circonus style" changes the game entirely. What's Circonus style?

We at Circonus believe that all data are important, not just web data. We believe that if you can't see what's happening right now, you are as good as blind. So take this real-time, multi-dimensional statistical analysis engine, feed it any data you want, and see it all in real-time.

With our snazzy new BizEKG service you can actually do what some might consider a sufficient level of black magic. You can decompose these events in realtime and visualize these histograms in realtime. Not only is this pretty cool... it's pretty damn enlightening. BizEKG is a new service we've launched and deserves its own announcement, we'll get to that soon.

The above histogram show the last 60 seconds of page load times of a subsection of a current Alexa top 1000 site in milliseconds. Yes, 10,000ms is 10 seconds of page load time. Even on today's Internet, loading a complex site over wireless from another country is... slow.

2011-03-23 18:45:36 A Lotta Love for Keyboard Users

by Charlie Fiskeaux II

All web users who bemoan the general lack of support for keyboard accessibility in web apps, take heart! Circonus has some great features for keyboard lovers. We know there are many web users out there for whom keyboard shortcuts are a quicker and easier way to use applications, particularly web apps. This is especially true if you use a specific app heavily, or are a full-time computer user in general.

Anywhere in Circonus, you can always see the keyboard help screen by typing “?” so you’ll have an ever-present “cheat sheet” as you learn the shortcuts. As soon as new keyboard functionality is added, the keyboard help screen will be updated immediately to reflect the new shortcuts, thanks to the magic of continuous deployment.

Jump Navigation

To jump to a particular section in Circonus, all you have to do is type the proper keyword and you will jump there immediately. For example, type the keyword “dash” (d-a-s-h) and you will jump to the current account’s dashboard. It’s that easy! Here’s a list of the current jump keywords:

  • “dash” (jump to the dashboard)
  • “alerts” (jump to the fault dashboard)
  • “rules” (jump to rules)
  • “checks” (jump to checks)
  • “metrics” (jump to metrics)
  • “trends” (jump to the trending dashboard)
  • “graphs” (jump to graphs)
  • “worksheets” (jump to worksheets)

The shortcut for opening the feedback dialog also works the same way: simply type “feedback” and the feedback dialog will open for you. Another quick shortcut is the forward-slash (/), which focuses on any search field that may be on the page.

Graph & Worksheet Shortcuts

Here’s where we get to the good stuff. We’ve added some great shortcuts to work with graphs and an enhanced zoom tool which is only available via keyboard shortcuts.

To start off, you can now see the legend on any thumbnail graph view (on “My Graphs,” “Trending Dashboard,” and all worksheets) not only on the large graph views as before. To do so, simply hold down the shift key, and the legend will appear for whichever graph you’re hovering over. On a worksheet, the shift key also inverts the legend hover option. So if you have enabled the new worksheet option to show legends upon graph hovering, holding down the shift key will disable the hovering legends.

Back in January we launched an enhanced graph zoom toolbar that relies on keyboard shortcuts. Normally the zoom toolbar is labeled “Past” because its buttons will set the graph zoom level to view data from the past one week, two weeks, etc. However, if you hold down either comma or period, the zoom tool will be enhanced and the label will change to “shift.” You will also see an orange bar at the end of the graph(s) which indicates the end that will be shifted (and if you hold both keys, you will get two orange bars, indicating that you can pan the entire graph date range into the past or the future). While holding one or both keys, click one of the new arrow icons that appear inside the “shift” buttons—the graph date(s) will be shifted by the specified amount in the specified direction. Not only does this work when viewing or editing graphs, it works almost everywhere there is one or more graphs, whether large or thumbnail sized.

One last set of useful shortcuts applies when viewing a worksheet. Among the newly added worksheet options is the ability to resize worksheet graphs to one of three sizes. In addition to being able to do this by clicking the buttons in the worksheet options dialog, you can instantly change the size of your worksheet graphs by pressing alt+1, alt+2, or alt+3.

Being avid keyboard users ourselves, we are excited to build keyboard support into more areas of Circonus as we are able to do so. Keep watching for more keyboard info and if you have ideas for some useful shortcuts, please let us know!

2011-03-22 19:00:36 Lost In Translation

by Theo Schlossnagle

For more than ten years, OmniTI has been making large-scale critical Internet infrastructure work. It is, obviously, not black magic or voodoo. Perhaps not so obviously, it is not technical competence that leads to success here. I like to think our team has technical competence in spades as we have an impeccable track record, authored books and a laundry list of speaking engagements to justify it. However, technical competence alone would fall short of the mark— far short.

Without exception, it is expected that proper monitoring and trending are as much a part of the process as setting up networking, backups, and more recently, change management. And yet, when you ask someone to explain why monitoring and trending were vital, you'd be lucky to get a response other than "to be sure things are working". Something here is lost in translation.

Disconnected Viewpoints

Every business owner knows that watching the books is part of the job. You need to know P&L, you need to understand the outputs and costs of your various business units and you track efficiencies everywhere. All of these metrics play a part in both strategic and tactical decisions made every day. Each business unit reports these things and while in good organizations each manager knows what is important to each other manager, something is still lost in translation. Far too often, managers don't understand that what they produce, what they consume and how they work changes the game for other business units. While the word is overused and abused, every business is an ecosystem. It is obvious that a new marketing campaign will increase resource utilization on the sales teams. It should be obvious that a new marketing campaign will increase resource utilization on IT infrastructure as well.

Every systems administrator knows (or should know) that monitoring your architecture is fundamental. On the other hand, very few can explain in any detail why this is so important. "Because you lose money when systems are offline", they'll quote disparagingly. Ask how much and you might catch them at a loss. From my own experience in operations, as well as countless conversations with customers and vendors, very few individuals recognize the relationship between IT and Business. Systems people know that they have to keep systems and services running to support their business, but rarely do they understand that relationship completely.

Owners that foster a transparent and cohesive organization around key performance indicators in every business unit (even those that are cost centers) will change their organizations in two critically useful ways:

  • Efficiencies between business units. With increased transparency, staff in all positions will see the effects of their actions across the business as a whole. This produces an atmosphere of self-reinforcing efficiency.
  • Accountability to the overall business. The hokey old question: "Is what you're doing good for the company?" changes form. With increased cohesiveness, the answer to that question is a more obvious outcome to every action and no one can call it hokey, because it is always answered without being asked.

A Call To Arms

Technology is no longer underneath the products you sell and the process in which you deliver them. It is, for at least the immediate future, intertwined. Creativity on the technology side doesn't only deliver cost savings, it creates new audiences and increases interaction with your customers. You have to do more than embrace technology, you need to leverage it and let new opportunities catapult your business forward.

As intertwined as technology is, we can no longer afford to have its operational details hidden away in the bowels of the "tech ops" or "web ops" group. We need visibility and we need cohesion. Infrastructure/application engineering and other business units are now, more than ever before, on the same team marching towards success. Communication and accountability are critical to success.

Here is where I leave you and hope that you will think about the metrics you monitor in a different light. They represent something more. They are there to make the business run, increase shareholder value, make your customers happier and more prosperous.