2011-01-19 18:27:31 Past Performance: does this look right to you?
by Theo Schlossnagle
If you are like me, you look at a lot of data. I look at data in spreadsheets, I look at data on P&L statements, I look at term sheets, I look at systems data — a lot of systems data. I find the best way to look at data is to visualize it because it is the fastest way to get data into the amazing pattern matcher that is the human brain.
The human brain is quite good at saying “this is abnormal” and can usually even articulate why. This curve has a periodicity, that one a monotonic behavior, another is simply always flat… then they “change.” When we say “this visualization looks wrong,” we are almost always onto something real in the numbers. I'll give you a simple visual example:
While there is obviously something starting at 8pm, we are only left with another question: “is it out of the ordinary?” It doesn't look like that today, and it doesn't appear to resemble the day before. What about last week? Let's start the graph one week earlier:
This tells us a lot. It looks like we have a very similar event last week at this time. With most analysis tools, you stop here (or you hover with you mouse and try to correlate start/end times and magnitude to better understand how these two events resemble each other).
With Circonus, we don't leave it here. Instead, we provide tools to help compare time separated events using our data overlay feature. We can take our original two-day view and overlay the data from last week right on top of (or in this case underneath).
Just two clicks and we've got a one-week offset data overlay and the visualization lends a little insight into what is going on. We can see the start times are identical, but the event from this week ends about 30 minutes before the one from last week — largely the same though.
Again, we find that visuals help. Understanding how these graph differ even when they are right on top of each other can be a bit challenging. Never fear! We've added help in the legend.
The legend takes on some new features when data overlays are in use. You now get a very clear, side-by-side read-out of the data in the graph including percentage differences. Additionally, the arrows that say “you're higher than you were last week” become more saturated (redder) as the different in the data increases and fade to light grey if the two values are more similar. This makes it simple to quickly understand how current performance really compares to past performance. So, the interesting part of this graph is actually the subsequent spike of inbound traffic this is up 95% over last week. That's something to look into.
2011-01-05 20:11:49 Capacity Planning Made Easy
by Theo Schlossnagle
Okay, so capacity planning will never be fool proof. You simply cannot predict the future. However, some of the time you have a darn good idea of what the future will hold. Since someone knows what is likely to happen, why is it so hard to plan marketing initiatives, funnels and IT provisioning?
The reason is that things aren't always linearly correlated. What's that mean? Linear correlation goes something like this: if A depends upon B and I want twice as much A, I'll need twice as much B. While correlating non-linear systems is what I like to call BFM a lot can be done with linear regressions. The problem with any regression is that you need to put real numbers in, get real numbers out and understand how good they are.
When we look at how something grows, one of the most common tools in the statistics arsenal is a least-squared linear regression. That is: given a set of datapoints, what line best fits them? So, let's say we have a lot of datapoints (boy do we have a lot of datapoints!). Now what does a linear regression tell us?
Let's assume we're looking at some traffic data over the month of December.
In this graph, it can be very hard to answer questions about the nature of the data. Two common questions are:
- are we growing or shrinking and by how much?
- if we stay on the current growth path, where will we be some point in the future?
Enter the linear regression:
Answering the first question is pretty simple now. We can look at the value on the left side of the graph, and the right side of the graph and do the math. You can't see it in the screenshot, but the left the values are 5.49M and 5.88M which is roughly a 6.6% growth over 4 weeks. Now, any statistician will scream bloody murder about confidences in the data and model and any engineer will simply ask: "does that make sense?" Maybe we'll look over 8 weeks and twelve weeks also to make sure that we build our confidence (this can be easier, though far less scientific, than understanding R2 values - which are, of course, available as well). Honestly, I personally find that reconciling this with my expectations is one of the better methods of trusting the model.
Let's assume that we we expected some increase in resource usage during this time frame and that 6% is reasonable. Now onto the next question: where will we be in the future. In Circonus, we just jump up and extend our view window out one year and we can see what our model looks like in the future:
Next December we'll be using 10.91M (this just happens to be MBits/s of network bandwidth to serve origin dynamic content on one of the sites managed over at OmniTI). We'll revisit this month by month to ensure that we are indeed heading where we expected. It allows engineers and marketers and executives alike put real numbers into (what we call) napkin math which adds peace, clarity and allows most people to do easier what-if pontification. I can tell you one thing... we sleep better at night knowing specific numbers about a probable future.
2010-12-10 21:14:35 Enterprise Agents
by Theo Schlossnagle
If you're like me, your first response to SaaS monitoring was: "You can't see my machines/services/metrics from your cloud. That won't be too useful." With a little bit of thought, it's pretty easy to arrive at the conclusion that you must run something on your infrastructure to bridge the divide. It was a fun and exciting project here to build that magic something called the Circonus Enterprise Agent.
The Circonus Enterprise Agent (we'll call it our "EA" from here on) is all of our magic monitoring software bundled into a maintainable VMWare virtual appliance than can be run on your internal networks to track stuff that the public shouldn't be seeing. We had some interesting choices to make during development and I thought I'd share what they were and why we made them.
Choosing a platform
Most of our internal infrastructure runs on some variant of OpenSolaris technology. We chose this for a variety of reasons. Most importantly, storing your precious data on ZFS seemed like the right thing to do. After that, the fault management architecture (FMA) available in OpenSolaris allows us to keep our machines and services running more reliably. Reliability and data permanence are the two most important factors in technology selection here at Circonus (a fact our customers respect).
So, with all this talk about OpenSolaris and its advantages you'd imagine we built our EA on the same technology, right? Not so simple. For a virtual appliance image that is easy to administer and easy to upgrade in the field you need a good package management system. OpenSolaris simply falls on its face there. Oracle's promises of IPS (the new and coming package management system for Solaris 11) are quite compelling, but that is just a promise today. Instead, we turned to the tried and true CentOS Linux-based platform for our EA.
CentOS provides all the features we need to run our agent software, manage package upgrades and distribution seamlessly and simply, and the core operating system is both stable and secure. In an interesting later development, we provide Joyent customers the ability to run an EA on one of their Joyent SmartMachines. Joyent's operating architecture is actually derived from OpenSolaris — so we ended up porting our EA back to our core platform as well.
Today, the EA is available in two forms: a CentOS 5 VMWare-based appliance and a Joyent SmartMachine.
From where do you manage the appliance
While most appliances have a web console that allows a variety of management tasks, we made a simple choice to have the appliance administrable via the main circonus.com web application. This is where Circonus users interface with all their data and set up their monitors, sp it only made sense to also administer their EA from the same place.
After using the system for a while now, I can say that I'm really pleased with this decision. The cohesiveness of scheduling checks on either your private EA and/or the world-wide Circonus agents through the same check creation interface is a simple pleasure. One single world-wide view of all the agents on which you can schedule checks makes it simple to understand how the monitoring system works.
What to automate
Generally speaking, when you think appliance, you think self-maintaining. That's not an unreasonable expectation. However, this directly conflicts with our experience in operations. In operations, automatic upgrades of software are strictly taboo. Typically, the operations crew wants to schedule precisely when an upgrade will occur, be present and have a bulletproof evaluation and rollback plan. When you start talking about critical infrastructure like monitoring, "typically" becomes "always."
With this in mind, we made the upgrade process on the EA completely automated, but not automatic. One click and the appliance will self-upgrade. Currently, this is the only ongoing task that is done from the appliance itself (rather than the circonus.com portal), but we're looking to make some nice enhancements there as well. Soon, you'll be able to trigger remote EA upgrades directly from the web application.
What you get
With an EA you get to leverage the power of Circonus against all of your private data. Networks, systems, applications and business systems that are only accessible via internal infrastructure can be monitored via an Enterprise Agent. The data is fed back to the Circonus cloud in real-time. All of that data can be alerted on, and is available for correlation, trending and planning purposes through the excellent Circonus tools you already know and love.
2010-12-08 18:49:23 Finding Needles in a Worksheet
by Jason Dixon
Traditional graphing tools can help you plan for growth or even narrow down root causes after a failure. But they' have a reputation for being difficult to setup, navigate or customize. It's nice to be able to just point Cacti at some switches or routers and have it gracefully poll each device for SNMP data. Yet when you need a custom perspective of the data (or collections of data), it can be an arduous experience setting up templates and graphs.
When we started to engineer Reconnoiter into a SaaS offering, one of the major driving forces was a desire to not suck like the others. Like you, we don't understand why it has to be so damn hard (or require a dedicated IT staff) to take a handful of datapoints and correlate them into graphs that make sense of the noise. I like to think we've been successful. Customers have been overwhelmingly positive about our efforts, calling it "a graph nerd's paradise". Even still, we eat our own dog food and are constantly revisiting the service to look for better ways to get our work done. This is why we're working hard on upcoming features like Graph Overlays and Timeline Annotations. And it's also why we made recent changes to the workflow for graphs and worksheeets.
If you're a Circonus user, you already know how easy it is to create and view graphs. Adding them to worksheets gives you a page full of data to compare and relate. Choose a zoom preset (2 days, 2 weeks, etc) or select a date range, and all of the thumbnails are instantly redrawn in unison. It might sound basic, but it can be very useful if you're not sure what you're looking for. Unexpected patterns jump out at you pretty quickly.
However, most of the time you want to work with a single graph. Clicking on a thumbnail previously loaded a graph in "lightbox" view, hiding all other graphs from sight and letting you focus on the work at hand. This worked well most of the time, but had one big drawback... you couldn't (easily) bookmark it. So we've moved the default view into its own page, sans lightbox, that can be bookmarked and shared with others. Miss the lightbox view? No worries, we've kept that as the new preview mode. Try it out in a worksheet for "flickr-style" navigation.
Here's a short video I threw together to demonstrate some of these changes. There was some audio lag introduced by the YouTube processing, but it should be easy enough to follow along. If you'd like to see more examples like this one, shoot us an email and we'll try to keep them coming.
2010-11-03 14:23:22 Access Tokens with the Circonus API
by Jason Dixon
When we rolled out our initial API months ago, we took a first stab at getting the most useful features exposed to help customers get up to speed with the service. A handful of our users expressed displeasure with having to use their login credentials for basic access to the management API. Starting today, we're pleased to announce support for access tokens within the Circonus API.
Tokens offer fine-grained access for each user to a specific service account, at your permission role or lower. For example, if Bob is a normal user on the Acme Inc. account, he can create tokens allowing normal or read-only access. Multiple applications can use the same token, but each application has to be approved by Bob in the token management page, diabolically named My Tokens. To get started, browse over to this page inside your user profile, select your account from the drop-down and click the "plus tab" to create your first token.
The first time you try to connect with a new application using your token, the API service will hand back a HTTP/1.1 401 Authorization Required. When you visit the My Tokens page again you'll see a button to approve the new application-token request. Once this has been approved you'll be able to connect to the API with your new application-token.
Using the token is even easier. Just pass the token as X-Circonus-Auth-Token and your application name as X-Circonus-App-Name in your request headers. Here's a basic example using curl from the command-line:
$ curl -H "X-Circonus-Auth-Token: ec45e8a2-d6d9-624c-c21c-a83f573731c1" \
-H "X-Circonus-App-Name: testapp" \
https://circonus.com/api/json/list_accounts
[{
"account":"social_networks",
"account_description":"Monitoring for The Social Network.",
"account_name":"Social Networks"
"circonus_metric_limit":500,
"circonus_metrics_used":124,
}]
One of the more convenient features with our tokens is how well they integrate with user roles. A token will never have higher access permissions than its owner. In fact, if you lower a user's role on your account, their tokens automatically reflect this as well. Changing a "normal" user to "read-only" will render their tokens the same access level. But if you restore their original role, the token will also have its original privileges restored. Secure and convenient.
If you have any questions about our new API tokens or would like to see more examples with the Circonus API, drop us a line at hello@circonus.com.
2010-11-02 19:40:02 Annotating Alerts and Recoveries
by Jason Dixon
In the last couple of posts, Brian introduced our new WebHook notifications feature and I demonstrated how Circonus can graph text metrics for Visualizing Regressions. Both of these features are interesting enough on their own, but let's not stop there. Today I have an easy demonstration showing how you can re-import your alert information to your trends. The end goal is an annotation on our graph that can be used to help identify, at a glance, which alert(s) correspond with anomalies on your graphs.
First, let's set a WebHook Notification in our Circonus account profile. Choose the contact group that it should belong to, or create a new contact group specifically for this exercise. Type the URL where you want to POST your alert details in the custom contact field and hit enter to save the new contact.
Now we need something for our webhook to act as a recipient. For this example I have a simple Perl CGI script that listens for the POST notification, parses the contents, and writes out Circonus-compatible XML. It doesn't matter which language you use, as long as you can extract the necessary information and write it back out in the correct XML format (Resmon DTD).
2010-10-25 22:50:48 Visualizing Regressions
by Jason Dixon
We've heard a lot of talk about Continuous Deployment strategies over the last 12-18 months. Timothy Fitz was one of the earliest proponents, publishing stories of their success over at IMVU last year. One of the greatest benefits to continually pushing your changes to production is that it takes less time and effort to find bugs when something goes wrong, since you have fewer commits in-between to navigate. But even with this style of release management, it helps to know which versions of code are running live on your components at any point. What happens when your newest code is enough to alter the normal behavior of the system, but not so drastic as to trigger an alert?
One of the nicer trending features in Circonus (or its open-source relative, Reconnoiter) is the ability to correlate unrelated datasets. I can take any collection of metrics on my account and group them together on a single graph. But what if you could view isolated events on the same graph, as an orthogonal data point? Check out these two graphs displaying some recent activity on one of our fault detection systems. The vertical lines represent the point at which a text metric's value changed. Circonus renders them this way so you can easily recognize that specific moment in time.
In the first graph I'm hovering over a dip in performance caused by the most recent release to that comment (svn r6230). In the second graph we're running a fix (svn r6232) for the regression introduced in the previous commit. Could I have done the same level of correlation manually? Of course, but it's nice to be able to zoom out and study the long-term affects of our release strategy on our overall stability. This is an enormously helpful tool for investigating Root Cause Analysis on our live systems, especially if you perform releases many times in a week (like we do). If you're one of many using automation and Configuration Management suites like Puppet, Chef and the Marionette Collective, no doubt you'll find it even more useful.
If you'd like to start trending your own text metrics, check out the Resmon DTD. Circonus can pull in your custom metrics in this format. Although the version numbers I mentioned earlier look like integers (well, they are integers), I can explicitly cast them as a string metric using the Resmon DTD. Here is what that might look like:
<ResmonResults>
<ResmonResult module="Site::CircProd" service="vers">
<last_runtime_seconds>0.000274</last_runtime_seconds>
<last_update>1288044642</last_update>
<metric name="ernie" type="s">6297</metric>
</ResmonResult>
</ResmonResults>
As you might imagine, you can get pretty creative with the sort of data you can pull into Circonus. In our next post I plan to look at how you can combine WebHook Notifications (that Brian announced last week) with these text metrics to start trending your alert history. Stay tuned!
2010-10-22 15:55:05 WebHook Notifications
by Brian Clapper
This week we added support for webhook notifications in Circonus. For those that are unsure what a webhook is, its simply an HTTP POST with all the information about an alert you would normally get via email, XMPP or AIM.
Webhooks can be added to any contact group. Unlike other methods, you can't add one to an individual user, and then add that user to a group, however this might be supported in the future based on feedback. Simply go to your account profile, click on the field "Type to Add New Contact" on the group you would like to add the hook to, and enter the URL you would like us to contact. The contact type will then display as your URL with the method of HTTP (for brevity).
Now that your hook is setup, what will it look like when the data is posted to you? Here is a perl Data::Dumper example, grouped by alert for readability, of the parameters posted for 2 alerts:
%post = ( 'alert_id' => [ '21190', '21191' ], 'account_name' => 'My Account', 'severity_21190' => '1', 'metric_name_21190' => 'A', 'check_name_21190' => 'My Check', 'agent_21190' => 'Ashburn, VA, US', 'alert_value_21190' => '91.0', 'clear_value_21190' => '0.0', 'alert_time_21190' => 'Thu, 21 Oct 2010 16:35:49', 'clear_time_21190' => 'Thu, 21 Oct 2010 16:36:49', 'alert_url_21190' => 'https://circonus.com/account/my_account/fault-detection?alert_id=21190', 'severity_21191' => '1', 'metric_name_21191' => 'B', 'check_name_21191' => 'My Other Check', 'agent_21191' => 'Ashburn, VA, US', 'alert_value_21191' => '91.0', 'alert_time_21191' => 'Thu, 21 Oct 2010 16:36:21', 'alert_url_21191' => 'https://circonus.com/account/my_account/fault-detection?alert_id=21191', );
So lets look at what we have here. First thing to notice is that we pass multiple alert_id parameters, giving you the ID of each alert in the payload. From there, every other parameter is suffixed with _<alert_id> so you know which alert that parameter is associated with. In this example 21190 is a recovery, and 21191 is an alert, recoveries get the additional parameters of clear_value and clear_time.
Webhooks open up all sorts of possibilities both inside and outside of Circonus. Maybe you have a crazy complicated paging schedule, or prefer a contact method that we don't natively support yet, fair enough, let us post the data to you and you can integrate it however you like. Want to graph your alerts? We are in the process of working on a way to overlay alerts on any graphs, but in the meantime, setup your webhook and feed the data back to Circonus via Resmon XML, now you have data for your graphs.
If you are curious about other features and would like to see an in depth post on them, please contact us at hello@circonus.com.
2010-06-29 03:41:34 Monitoring for Agile Operations
by Jason Dixon
One of the big announcements for us at Velocity 2010 last week was the formal release of our Developer site and Management API. Designed as a RESTful service, the Circonus API was designed to allow users to programmatically adjust monitors and alerts as their architecture evolves. Currently it supports all basic functionality for managing Checks, Metrics, Contacts and Contact Groups, Rules and Metric Dependencies. Support for managing Graphs and Worksheets will be released in a future version.
But publishing a Web Services API is only the first part of the puzzle. You really have to cultivate the community using it, by demonstrating just how easy and powerful it really is. We're planning to publish tons of useful examples here and over at the Developer site in the days and weeks to come. You might even see examples in the form of Chef recipes or Puppet modules.
Coincidentally, the guys over at Opscode have been doing their part to help out too. Adam Jacob, the CTO of Opscode and creator of Chef, took it upon himself to extend our API and make it even easier for Ruby and Rails users. Check out his ruby-circonus project over at GitHub.
Needless to say, the disciplines of Agile Operations and Infrastructure as Code rely on the sort of programmatic elasticity that our new API makes possible. Deploying systems and services is just one small part of the solution; it's vital to track the performance of your IT systems and be able to correlate their effects on your Business systems. Automating your monitoring system to evolve in step with your architecture is a great way to avoid the human factor which will inevitably result in missing monitors and alerts.
2010-06-20 04:57:36 Good Times in Charm City
by Jason Dixon
It's been a while since I had time to enjoy the technical conference scene. Thanks to my involvement with Circonus, I have plenty of action scheduled between RailsConf, Velocity and the Surge Scalability Conference. We attended RailsConf in Baltimore a couple weeks ago and had a great time. Circonus had an exhibition booth and we gave out tons of demonstrations, free swag and t-shirts. But the best part of any con is catching up with old friends and making new ones.
I finally met Mark Imbriaco of 37signals in person. Mark has been a valued user for us, giving plenty of awesome feedback during the beta and after our production launch. If you haven't seen it already, check out Mark's interview on webpulp.tv. He offers a lot of insight into 37signals' operations and architecture. Good stuff.
Last but not least, a nice relationship blossomed out of our participation at RailsConf. I've been aware of the RPM service over at NewRelic for a while now. Although they sometimes market it as monitoring software for Rails, a more apt description would be to call it a kickass profiling tool for Ruby and Java applications. It's very useful for tracking down performance issues within your application code. But what happens when the problem isn't in your source code... or maybe you're just not sure? Fortunately for NewRelic RPM users, the solution just became very clear.