2010-12-10 21:14:35 Enterprise Agents

by Theo Schlossnagle

If you're like me, your first response to SaaS monitoring was: "You can't see my machines/services/metrics from your cloud. That won't be too useful." With a little bit of thought, it's pretty easy to arrive at the conclusion that you must run something on your infrastructure to bridge the divide. It was a fun and exciting project here to build that magic something called the Circonus Enterprise Agent.

The Circonus Enterprise Agent (we'll call it our "EA" from here on) is all of our magic monitoring software bundled into a maintainable VMWare virtual appliance than can be run on your internal networks to track stuff that the public shouldn't be seeing. We had some interesting choices to make during development and I thought I'd share what they were and why we made them.

Choosing a platform

Most of our internal infrastructure runs on some variant of OpenSolaris technology. We chose this for a variety of reasons. Most importantly, storing your precious data on ZFS seemed like the right thing to do. After that, the fault management architecture (FMA) available in OpenSolaris allows us to keep our machines and services running more reliably. Reliability and data permanence are the two most important factors in technology selection here at Circonus (a fact our customers respect).

So, with all this talk about OpenSolaris and its advantages you'd imagine we built our EA on the same technology, right? Not so simple. For a virtual appliance image that is easy to administer and easy to upgrade in the field you need a good package management system. OpenSolaris simply falls on its face there. Oracle's promises of IPS (the new and coming package management system for Solaris 11) are quite compelling, but that is just a promise today. Instead, we turned to the tried and true CentOS Linux-based platform for our EA.

CentOS provides all the features we need to run our agent software, manage package upgrades and distribution seamlessly and simply, and the core operating system is both stable and secure. In an interesting later development, we provide Joyent customers the ability to run an EA on one of their Joyent SmartMachines. Joyent's operating architecture is actually derived from OpenSolaris — so we ended up porting our EA back to our core platform as well.

Today, the EA is available in two forms: a CentOS 5 VMWare-based appliance and a Joyent SmartMachine.

From where do you manage the appliance

While most appliances have a web console that allows a variety of management tasks, we made a simple choice to have the appliance administrable via the main circonus.com web application. This is where Circonus users interface with all their data and set up their monitors, sp it only made sense to also administer their EA from the same place.

After using the system for a while now, I can say that I'm really pleased with this decision. The cohesiveness of scheduling checks on either your private EA and/or the world-wide Circonus agents through the same check creation interface is a simple pleasure. One single world-wide view of all the agents on which you can schedule checks makes it simple to understand how the monitoring system works.

What to automate

Generally speaking, when you think appliance, you think self-maintaining. That's not an unreasonable expectation. However, this directly conflicts with our experience in operations. In operations, automatic upgrades of software are strictly taboo. Typically, the operations crew wants to schedule precisely when an upgrade will occur, be present and have a bulletproof evaluation and rollback plan. When you start talking about critical infrastructure like monitoring, "typically" becomes "always."

With this in mind, we made the upgrade process on the EA completely automated, but not automatic. One click and the appliance will self-upgrade. Currently, this is the only ongoing task that is done from the appliance itself (rather than the circonus.com portal), but we're looking to make some nice enhancements there as well. Soon, you'll be able to trigger remote EA upgrades directly from the web application.

What you get

With an EA you get to leverage the power of Circonus against all of your private data. Networks, systems, applications and business systems that are only accessible via internal infrastructure can be monitored via an Enterprise Agent. The data is fed back to the Circonus cloud in real-time. All of that data can be alerted on, and is available for correlation, trending and planning purposes through the excellent Circonus tools you already know and love.

2010-12-08 18:49:23 Finding Needles in a Worksheet

by Jason Dixon

Traditional graphing tools can help you plan for growth or even narrow down root causes after a failure. But they' have a reputation for being difficult to setup, navigate or customize. It's nice to be able to just point Cacti at some switches or routers and have it gracefully poll each device for SNMP data. Yet when you need a custom perspective of the data (or collections of data), it can be an arduous experience setting up templates and graphs.

When we started to engineer Reconnoiter into a SaaS offering, one of the major driving forces was a desire to not suck like the others. Like you, we don't understand why it has to be so damn hard (or require a dedicated IT staff) to take a handful of datapoints and correlate them into graphs that make sense of the noise. I like to think we've been successful. Customers have been overwhelmingly positive about our efforts, calling it "a graph nerd's paradise". Even still, we eat our own dog food and are constantly revisiting the service to look for better ways to get our work done. This is why we're working hard on upcoming features like Graph Overlays and Timeline Annotations. And it's also why we made recent changes to the workflow for graphs and worksheeets.

If you're a Circonus user, you already know how easy it is to create and view graphs. Adding them to worksheets gives you a page full of data to compare and relate. Choose a zoom preset (2 days, 2 weeks, etc) or select a date range, and all of the thumbnails are instantly redrawn in unison. It might sound basic, but it can be very useful if you're not sure what you're looking for. Unexpected patterns jump out at you pretty quickly.

However, most of the time you want to work with a single graph. Clicking on a thumbnail previously loaded a graph in "lightbox" view, hiding all other graphs from sight and letting you focus on the work at hand. This worked well most of the time, but had one big drawback... you couldn't (easily) bookmark it. So we've moved the default view into its own page, sans lightbox, that can be bookmarked and shared with others. Miss the lightbox view? No worries, we've kept that as the new preview mode. Try it out in a worksheet for "flickr-style" navigation.

Here's a short video I threw together to demonstrate some of these changes. There was some audio lag introduced by the YouTube processing, but it should be easy enough to follow along. If you'd like to see more examples like this one, shoot us an email and we'll try to keep them coming.

2010-11-03 14:23:22 Access Tokens with the Circonus API

by Jason Dixon

When we rolled out our initial API months ago, we took a first stab at getting the most useful features exposed to help customers get up to speed with the service. A handful of our users expressed displeasure with having to use their login credentials for basic access to the management API. Starting today, we're pleased to announce support for access tokens within the Circonus API.

Tokens offer fine-grained access for each user to a specific service account, at your permission role or lower. For example, if Bob is a normal user on the Acme Inc. account, he can create tokens allowing normal or read-only access. Multiple applications can use the same token, but each application has to be approved by Bob in the token management page, diabolically named My Tokens. To get started, browse over to this page inside your user profile, select your account from the drop-down and click the "plus tab" to create your first token.

The first time you try to connect with a new application using your token, the API service will hand back a HTTP/1.1 401 Authorization Required. When you visit the My Tokens page again you'll see a button to approve the new application-token request. Once this has been approved you'll be able to connect to the API with your new application-token.

Using the token is even easier. Just pass the token as X-Circonus-Auth-Token and your application name as X-Circonus-App-Name in your request headers. Here's a basic example using curl from the command-line:

$ curl -H "X-Circonus-Auth-Token: ec45e8a2-d6d9-624c-c21c-a83f573731c1" \
       -H "X-Circonus-App-Name: testapp" \
           https://circonus.com/api/json/list_accounts

[{
   "account":"social_networks",
   "account_description":"Monitoring for The Social Network.",
   "account_name":"Social Networks"
   "circonus_metric_limit":500,
   "circonus_metrics_used":124,
}]

One of the more convenient features with our tokens is how well they integrate with user roles. A token will never have higher access permissions than its owner. In fact, if you lower a user's role on your account, their tokens automatically reflect this as well. Changing a "normal" user to "read-only" will render their tokens the same access level. But if you restore their original role, the token will also have its original privileges restored. Secure and convenient.

If you have any questions about our new API tokens or would like to see more examples with the Circonus API, drop us a line at hello@circonus.com.

2010-11-02 19:40:02 Annotating Alerts and Recoveries

by Jason Dixon

In the last couple of posts, Brian introduced our new WebHook notifications feature and I demonstrated how Circonus can graph text metrics for Visualizing Regressions. Both of these features are interesting enough on their own, but let's not stop there. Today I have an easy demonstration showing how you can re-import your alert information to your trends. The end goal is an annotation on our graph that can be used to help identify, at a glance, which alert(s) correspond with anomalies on your graphs.

First, let's set a WebHook Notification in our Circonus account profile. Choose the contact group that it should belong to, or create a new contact group specifically for this exercise. Type the URL where you want to POST your alert details in the custom contact field and hit enter to save the new contact.

Now we need something for our webhook to act as a recipient. For this example I have a simple Perl CGI script that listens for the POST notification, parses the contents, and writes out Circonus-compatible XML. It doesn't matter which language you use, as long as you can extract the necessary information and write it back out in the correct XML format (Resmon DTD).

Read the rest of this story...

2010-10-25 22:50:48 Visualizing Regressions

by Jason Dixon

We've heard a lot of talk about Continuous Deployment strategies over the last 12-18 months. Timothy Fitz was one of the earliest proponents, publishing stories of their success over at IMVU last year. One of the greatest benefits to continually pushing your changes to production is that it takes less time and effort to find bugs when something goes wrong, since you have fewer commits in-between to navigate. But even with this style of release management, it helps to know which versions of code are running live on your components at any point. What happens when your newest code is enough to alter the normal behavior of the system, but not so drastic as to trigger an alert?

One of the nicer trending features in Circonus (or its open-source relative, Reconnoiter) is the ability to correlate unrelated datasets. I can take any collection of metrics on my account and group them together on a single graph. But what if you could view isolated events on the same graph, as an orthogonal data point? Check out these two graphs displaying some recent activity on one of our fault detection systems. The vertical lines represent the point at which a text metric's value changed. Circonus renders them this way so you can easily recognize that specific moment in time.

In the first graph I'm hovering over a dip in performance caused by the most recent release to that comment (svn r6230). In the second graph we're running a fix (svn r6232) for the regression introduced in the previous commit. Could I have done the same level of correlation manually? Of course, but it's nice to be able to zoom out and study the long-term affects of our release strategy on our overall stability. This is an enormously helpful tool for investigating Root Cause Analysis on our live systems, especially if you perform releases many times in a week (like we do). If you're one of many using automation and Configuration Management suites like Puppet, Chef and the Marionette Collective, no doubt you'll find it even more useful.

If you'd like to start trending your own text metrics, check out the Resmon DTD. Circonus can pull in your custom metrics in this format. Although the version numbers I mentioned earlier look like integers (well, they are integers), I can explicitly cast them as a string metric using the Resmon DTD. Here is what that might look like:

<ResmonResults> 
  <ResmonResult module="Site::CircProd" service="vers"> 
    <last_runtime_seconds>0.000274</last_runtime_seconds> 
    <last_update>1288044642</last_update> 
    <metric name="ernie" type="s">6297</metric> 
  </ResmonResult> 
</ResmonResults> 

As you might imagine, you can get pretty creative with the sort of data you can pull into Circonus. In our next post I plan to look at how you can combine WebHook Notifications (that Brian announced last week) with these text metrics to start trending your alert history. Stay tuned!

2010-10-22 15:55:05 WebHook Notifications

by Brian Clapper

This week we added support for webhook notifications in Circonus. For those that are unsure what a webhook is, its simply an HTTP POST with all the information about an alert you would normally get via email, XMPP or AIM.

Webhooks can be added to any contact group. Unlike other methods, you can't add one to an individual user, and then add that user to a group, however this might be supported in the future based on feedback. Simply go to your account profile, click on the field "Type to Add New Contact" on the group you would like to add the hook to, and enter the URL you would like us to contact. The contact type will then display as your URL with the method of HTTP (for brevity).

Now that your hook is setup, what will it look like when the data is posted to you? Here is a perl Data::Dumper example, grouped by alert for readability, of the parameters posted for 2 alerts:

%post = (
   'alert_id' => [
   '21190',
   '21191'
   ],
   'account_name' => 'My Account',
   'severity_21190' => '1',
   'metric_name_21190' => 'A',
   'check_name_21190' => 'My Check',
   'agent_21190' => 'Ashburn, VA, US',
   'alert_value_21190' => '91.0',
   'clear_value_21190' => '0.0',
   'alert_time_21190' => 'Thu, 21 Oct 2010 16:35:49',
   'clear_time_21190' => 'Thu, 21 Oct 2010 16:36:49',
   'alert_url_21190' =>
   'https://circonus.com/account/my_account/fault-detection?alert_id=21190',
   'severity_21191' => '1',
   'metric_name_21191' => 'B',
   'check_name_21191' => 'My Other Check',
   'agent_21191' => 'Ashburn, VA, US',
   'alert_value_21191' => '91.0',
   'alert_time_21191' => 'Thu, 21 Oct 2010 16:36:21',
   'alert_url_21191' =>
   'https://circonus.com/account/my_account/fault-detection?alert_id=21191',
);

So lets look at what we have here. First thing to notice is that we pass multiple alert_id parameters, giving you the ID of each alert in the payload. From there, every other parameter is suffixed with _<alert_id> so you know which alert that parameter is associated with. In this example 21190 is a recovery, and 21191 is an alert, recoveries get the additional parameters of clear_value and clear_time.

Webhooks open up all sorts of possibilities both inside and outside of Circonus. Maybe you have a crazy complicated paging schedule, or prefer a contact method that we don't natively support yet, fair enough, let us post the data to you and you can integrate it however you like. Want to graph your alerts? We are in the process of working on a way to overlay alerts on any graphs, but in the meantime, setup your webhook and feed the data back to Circonus via Resmon XML, now you have data for your graphs.

If you are curious about other features and would like to see an in depth post on them, please contact us at hello@circonus.com.

2010-06-29 03:41:34 Monitoring for Agile Operations

by Jason Dixon

One of the big announcements for us at Velocity 2010 last week was the formal release of our Developer site and Management API. Designed as a RESTful service, the Circonus API was designed to allow users to programmatically adjust monitors and alerts as their architecture evolves. Currently it supports all basic functionality for managing Checks, Metrics, Contacts and Contact Groups, Rules and Metric Dependencies. Support for managing Graphs and Worksheets will be released in a future version.

But publishing a Web Services API is only the first part of the puzzle. You really have to cultivate the community using it, by demonstrating just how easy and powerful it really is. We're planning to publish tons of useful examples here and over at the Developer site in the days and weeks to come. You might even see examples in the form of Chef recipes or Puppet modules.

Coincidentally, the guys over at Opscode have been doing their part to help out too. Adam Jacob, the CTO of Opscode and creator of Chef, took it upon himself to extend our API and make it even easier for Ruby and Rails users. Check out his ruby-circonus project over at GitHub.

Needless to say, the disciplines of Agile Operations and Infrastructure as Code rely on the sort of programmatic elasticity that our new API makes possible. Deploying systems and services is just one small part of the solution; it's vital to track the performance of your IT systems and be able to correlate their effects on your Business systems. Automating your monitoring system to evolve in step with your architecture is a great way to avoid the human factor which will inevitably result in missing monitors and alerts.

2010-06-20 04:57:36 Good Times in Charm City

by Jason Dixon

It's been a while since I had time to enjoy the technical conference scene. Thanks to my involvement with Circonus, I have plenty of action scheduled between RailsConf, Velocity and the Surge Scalability Conference. We attended RailsConf in Baltimore a couple weeks ago and had a great time. Circonus had an exhibition booth and we gave out tons of demonstrations, free swag and t-shirts. But the best part of any con is catching up with old friends and making new ones.

I finally met Mark Imbriaco of 37signals in person. Mark has been a valued user for us, giving plenty of awesome feedback during the beta and after our production launch. If you haven't seen it already, check out Mark's interview on webpulp.tv. He offers a lot of insight into 37signals' operations and architecture. Good stuff.

Last but not least, a nice relationship blossomed out of our participation at RailsConf. I've been aware of the RPM service over at NewRelic for a while now. Although they sometimes market it as monitoring software for Rails, a more apt description would be to call it a kickass profiling tool for Ruby and Java applications. It's very useful for tracking down performance issues within your application code. But what happens when the problem isn't in your source code... or maybe you're just not sure? Fortunately for NewRelic RPM users, the solution just became very clear.

Read the rest of this story...

2010-06-03 16:02:52 Circonus at Velocity 2010

by Jason Dixon

Hot on the heels of our RailsConf ticket giveaway, we have another contest for a free pass to Velocity 2010! I'm really excited to attend this year's Velocity. It's the Web Performance event to attend, and a great place to see the sharpest whips in the industry.

Like before, the rules of this giveaway are simple. Just tweet a message about Circonus being at Velocity and ask your friends to retweet it. The original "twitterer" with the most retweets by Friday, June 14 at noon (12pm EDT) wins. Here's an example:

The @Circonus stuff is hot and it looks like they'll be at #velocityconf this year: http://l42.org/dA

That's an easy way to earn a free 2010 Velocity sessions pass ($1295 value). Free free to get creative with your tweet message. Our only requirements are that it's a positive message that mentions @Circonus and #velocityconf, and that it includes the http://l42.org/dA link.

Yay, free stuff!

2010-05-27 21:51:12 Circonus at RailsConf 2010

by Jason Dixon

We're anxious to meet and greet everyone at RailsConf next month in Baltimore. This will be our first conference appearance since the production launch. Some of our customers, including 37signals, will be visiting Charm City for this big event. I'm excited to see so many talented Web developers and operations folk in one conference. Having it in our hometown is icing on the cake.

As if that wasn't enough, we have a couple of fun things to announce. First, Circonus will be giving away a free RailsConf sessions pass! All you have to do is tweet a message about Circonus at RailsConf to your friends and ask them to retweet it for you. The individual with the most retweets by noon (12pm EDT) on Monday, May 31, 2010 wins. Here's an example tweet:

The @Circonus stuff is hot and it looks like they'll be at #railsconf this year: http://l42.org/bw

If you're keeping score at home, that's a free 2010 RailsConf sessions pass ($795 value) for the price of a few clicks. Free free to get creative with your tweet message. Our only requirements are that it's a positive message that mentions @Circonus and #railsconf, and that includes the http://l42.org/bw link.

Why are you still reading this? Go off and start tweeting for your free RailsConf pass (Conference Sessions Only).

See you in Baltimore!

2010-05-10 14:37:26 Your Visitors Don't Matter

by Jason Dixon

Consider me old-fashioned, but I remember a time when an alert notification meant something. Drives failed, servers ran short on memory, or a cage monkey pulled the wrong cable at 3 A.M. Regardless of the circumstance, it demanded attention. Those were the days.

Today, operations is all about doing more with less. No more dedicated hardware or late-night maintenance windows. Everything is virtual, cloud-based, or filling up squares in the grid. Automation reigns supreme, limitless scalability at our disposal. Abstraction at its finest.

But woe unto you, the flapping anomaly.

That visitor who tried to load your website was turned away, timed out and left to wither. Poor Jane wanted to view your site. She needed to view your site. She'd already submitted her order, only to be ignored. Forgotten. Disconnected with nary a trace to route nor a cookie to favor.

Jane was a victim of a numbers game. Someone, somewhere, decided that some problems don't matter. Which ones? Who cares? They don't matter. And because she happened to visit when this problem reared its head, you ignored her request. Who would ever make such a silly presumption that one failure is less important than another? What criteria is used to determine the worthiness of this alert or that one? Pure random circumstance, it would appear.

Many "uptime" services and monitoring suites promote the concept of selective or flapping failures. Vendors sell these features as a convenience, ostensibly as a sleep aide. The administrator's snooze-bar. I can't think of any other reason that ignoring a faulty condition would be considered a good thing. Perhaps they reason that only the check is affected. If it responds after the third attempt, it was probably ok for visitors all along. Right?

It's disappointing how many vendors embrace this broken methodology. It probably seemed innocent at a glance. But the damage has been done; recklessness has taken root. We've been conditioned to accept these transient malfunctions as mere operational speed bumps. Rather than address the problem, we nudge the threshold a tad higher. Throw additional nodes into the cluster. Increase capacity, while decreasing exposure.

But there is a more responsible alternative. What ever happened to purposeful, iterative corrections and Root Cause Analysis? Notifications may be annoying at times, but they serve a crucial function in a healthy production architecture. Ignored alerts lead to stagnant bugs, lost traffic and missed opportunities. Stop treating your visitors like they don't matter. There's no such thing as a flapping customer.

2010-04-19 14:51:05 Disrupting the Status Quo

by Jason Dixon

As a hobbyist programmer and full-time operations geek, I've been involved in my share of odd software projects. More often than not I've had to explain the purpose of the thing, answering numerous questions about the why, what or whowuzzit. I can say without any reservation that Circonus is that rare venture that breaks through the trappings of application design and me-too engineering principles to become something truly revolutionary. To use the product is to highlight Circonus' strengths. User reactions tell the story.

Bryan Allen, chief server wrangler over at Pobox, has been one of our earliest and most active Beta participants. These folks have been doing email services for longer than I've been using it. In a field this competitive, there is zero room for slack, and they know it. Bryan is a very sharp guy, so we were very pleased to read his thoughts on Circonus.

Monitoring, trending and fault analysis are tedious. So much so, most shops get them wrong, or don't bother at all. Circonus is already poised to be a disruptive player; making the tedious easy, fast and accurate.

I was grateful to meet Bryan in person during my visit to Philly for PostgreSQL Conference, U.S. 2010. I've learned that Pobox and OmniTI share a number of common technical interests and philosophies, so it should come as no large surprise that they'd see some value in our efforts.

On the other end of the spectrum, you have the team at 37signals. They are an established leader in web design and SaaS solutions. Their specific forte is with simple (yet powerful) productivity services like Basecamp, Backpack, Campfire and Highrise. Heck, they created Ruby on Rails. If anyone knows good web applications, you better believe they do. We were fortunate to have Mark Imbriaco, Operations Manager for 37signals, run Circonus through the paces during our Beta program.

Circonus' trending functions are incredibly powerful. The ability to consolidate metrics across a variety of services into a single graph makes it much easier to spot bottlenecks in one area that may correlate to performance problems in another. It's a graph nerd's paradise!

I'll have to take Mark's word on the last part. Many geeks' idea of paradise lies somewhere on a beach with a frosty beverage and a strong wireless signal. But if you're like Mark, and you need something to monitor your systems, you probably owe it to yourself to add Circonus to your shopping list.

There's one word that I've heard repeated a few times from users, that Circonus is disruptive. Occasionally you'll hear the word banted about to describe a new social media outlet or computing device. It's usually associated with a revolutionary technology. There's nothing new about monitoring, trending or fault detection. But there is something refreshingly insightful about the synergy of monitoring services on a single unified metric collection.

Enjoy the Revolution.

2010-03-06 22:48:37 Introducing Circonus

by Jason Dixon

Great ideas always begin with a catalyst. They can ignite in a flash of brilliance, or grow slowly like an ember hidden in the ashes of failure. Inspiration comes from different places, and is only ever cultivated into success with the right combination of talent, timing and fortitude.

And sometimes it just happens because you get fed up with inferior products.

The beginnings of Circonus land somewhere in-between. Created by the engineers at OmniTI, we've been dealing with the pains of performance monitoring and trending in highly scalable environments for years. We've tried various combinations of Open Source and COTS software packages, all of which left us with a sour taste and wanting for more.

Over the last couple of years, our team of highly skilled engineers, led by OmniTI's own Theo Schlossnagle, have been crafting and refining a truly convergent monitoring platform. Circonus started off as the Reconnoiter project, attempting to address the disconnect between existing monitoring and trending solutions.

Circonus is currently in a closed beta, receiving valuable feedback from customers and partners. We expect to launch publicly in April 2010. In the meantime, we'll use this blog as an outlet to discuss the upcoming release and divulge all the cool stuff in the pipeline. I hope you visit here often to find out what we're working on.

Jason Dixon
Product Manager
Circonus