Showing: articles tagged with "Theo Schlossnagle"
2011-08-16 23:53:35 What's in a number?
by Theo Schlossnagle
Numbers, numbers, numbers; we're all about numbers here at Circonus. We have trillions of data points which we feed into a slew of algorithms and processes to help our users identify problems with their data. But what are these numbers? It turns out that isn't an easy question to answer.
Like most monitoring systems, Circonus performs an action from which it extracts one or more "metrics." A common example is running a database query and measuring both the correctness of the result (as a boolean: good vs. bad) and the latency with which the answer was delivered. Similarly, it could load a web page, ensure that some specified content is successfully returned and measure the time it took. More concretely, when performing an HTTP transaction, it could obtain the following useful metrics: time to establish the TCP connection, time until the first byte of data is received, and time until the last byte of data is received. These measurements can reveal a variety of problems both on the surface of your architecture as well as provide indications of issues deep within.
While most monitoring systems (and parts of Circonus) work this way, the nature of these metrics is most interesting in what it is missing. In other words, it is vital to understand what they do not tell you. You are not observing real information; instead you are producing a single synthetic event and measuring it. The data are not real (and worse, may be far from representative.) Before I dive in and talk about why these data aren't "good," I'll talk a bit about why they are "good enough" for many things.
Synthetic measurements work very well for components that can be measured in terms of quantities or rates. How many of something do you have? How quickly is it increasing or decreasing? Simple things like this are: disk space, I/O operations per second, the number of HTTP requests serviced, CPU usage, memory usage, etc. The most important factor is that these things are one-dimensional.
Data like these are both easy to visualize and critically important for things like anomaly detection and capacity planning. Being of a single dimension, understanding patterns in the data is easier for both humans and computers. However, as we start combining these data points, the world goes quickly out of focus.
For the moment, let's assume we measure total money spent on an e-commerce site (you'd be crazy to not measure this.) In addition to that, we measure total transactions performed (number of sales.) With these metrics, we have some clear data: total dollars and dollars/hour (by deriving the samples) and total sales and sales/hour (again by deriving.) These numbers are pretty clear and we can make some good judgments about what to expect from day to day. However, you might ask, "How much is the average transaction size?" The answer to this question is simple: total money spent divided by total sales. Unfortunately, the average is not a useful number; just ask any statistician.
When you start looking at averages, you start losing information. We use averages to zoom out on graphs; you might notice that when you have a sudden spike (let's say in traffic) you will see a much higher spike when zoomed in than when zoomed out. Why? If you were serving between 2900 and 3300 requests per second between 7pm and 8pm except for a sudden spike of 5400 requests per second between 7:40 and 7:45, you would see that on a graph showing 5 minute averages. However, on a graph zoomed out far enough to show only 20 minute averages, you'd see a deceptively small spike of about 3400 rps at that time period. As long as you can zoom in on the time series, it can be an acceptable compromise to reduce the data volume down to something consumable by a mere human being. Then the obvious question is: when does this go horribly wrong?
Let's look at something like web page load times. If you run a synthetic transaction, always from the same location, you can track measurements in that single dimension. Things should be somewhat consistent and these numbers are useful. However, they do not tell you how fast your site is. Only your users know that. Interestingly, since your users access your web site, you can actually have them report that information back to you. In fact, this is how most web analytics systems work. The interesting part here is that you have a wide variety of data coming in representing a distribution of perceived load times. Some people load your pages quickly and others load them slowly. That's the nature of the Internet: inconsistency. The key is that they don't "trend" as a single datapoint that is the average of all.
The inconsistency in these data is interesting: it can be leveraged for improvements and advantage. Understanding (and eventually changing) the distribution of these data can radically change your business. There have been many articles written about web page load times, so in order to keep this fresh, I'll discuss database transactions. The reason I'm jumping around here is because data are just data -- this applies to every metric you can observe.
Understanding that your average database query takes 1.92ms to complete is, I'm sorry to say, useless. The problem is that you are likely running thousands or tens of thousands of queries per second and none of them are average. To illustrate this, here are three (contrived) database query latency histograms each of 39 samples.
The interesting (and perhaps deceptive) part is that all three have an average latency across all queries of 1.92ms. Quite clearly, all depict radically different situations. The truth is, when you have a lot of data (thousands to hundreds of thousands of data points), the histogram reveals the information you seek and the average hides it.
Why is this so interesting? In computing, there are a lot of things we can witness by actively measuring them; this is what the Circonus you know and love has done. We figured it was time to change the game a bit and help you visualize, in real-time, the things that happen in your business: enter BizEKG.
BizEKG allows you to analyze events (like webpage loads, database queries, customer service telephone calls, etc.). Not just some, not just a sample, but all the events. From there, you can break them apart, run statistical analysis (including histograms, of course) and understand your data. There are a handful of real-time web analytics companies out there, but answering these questions in "Circonus style" changes the game entirely. What's Circonus style?
We at Circonus believe that all data are important, not just web data. We believe that if you can't see what's happening right now, you are as good as blind. So take this real-time, multi-dimensional statistical analysis engine, feed it any data you want, and see it all in real-time.
With our snazzy new BizEKG service you can actually do what some might consider a sufficient level of black magic. You can decompose these events in realtime and visualize these histograms in realtime. Not only is this pretty cool... it's pretty damn enlightening. BizEKG is a new service we've launched and deserves its own announcement, we'll get to that soon.
The above histogram show the last 60 seconds of page load times of a subsection of a current Alexa top 1000 site in milliseconds. Yes, 10,000ms is 10 seconds of page load time. Even on today's Internet, loading a complex site over wireless from another country is... slow.
2011-03-22 19:00:36 Lost In Translation
by Theo Schlossnagle
For more than ten years, OmniTI has been making large-scale critical Internet infrastructure work. It is, obviously, not black magic or voodoo. Perhaps not so obviously, it is not technical competence that leads to success here. I like to think our team has technical competence in spades as we have an impeccable track record, authored books and a laundry list of speaking engagements to justify it. However, technical competence alone would fall short of the mark— far short.
Without exception, it is expected that proper monitoring and trending are as much a part of the process as setting up networking, backups, and more recently, change management. And yet, when you ask someone to explain why monitoring and trending were vital, you'd be lucky to get a response other than "to be sure things are working". Something here is lost in translation.
Disconnected Viewpoints
Every business owner knows that watching the books is part of the job. You need to know P&L, you need to understand the outputs and costs of your various business units and you track efficiencies everywhere. All of these metrics play a part in both strategic and tactical decisions made every day. Each business unit reports these things and while in good organizations each manager knows what is important to each other manager, something is still lost in translation. Far too often, managers don't understand that what they produce, what they consume and how they work changes the game for other business units. While the word is overused and abused, every business is an ecosystem. It is obvious that a new marketing campaign will increase resource utilization on the sales teams. It should be obvious that a new marketing campaign will increase resource utilization on IT infrastructure as well.
Every systems administrator knows (or should know) that monitoring your architecture is fundamental. On the other hand, very few can explain in any detail why this is so important. "Because you lose money when systems are offline", they'll quote disparagingly. Ask how much and you might catch them at a loss. From my own experience in operations, as well as countless conversations with customers and vendors, very few individuals recognize the relationship between IT and Business. Systems people know that they have to keep systems and services running to support their business, but rarely do they understand that relationship completely.
Owners that foster a transparent and cohesive organization around key performance indicators in every business unit (even those that are cost centers) will change their organizations in two critically useful ways:
- Efficiencies between business units. With increased transparency, staff in all positions will see the effects of their actions across the business as a whole. This produces an atmosphere of self-reinforcing efficiency.
- Accountability to the overall business. The hokey old question: "Is what you're doing good for the company?" changes form. With increased cohesiveness, the answer to that question is a more obvious outcome to every action and no one can call it hokey, because it is always answered without being asked.
A Call To Arms
Technology is no longer underneath the products you sell and the process in which you deliver them. It is, for at least the immediate future, intertwined. Creativity on the technology side doesn't only deliver cost savings, it creates new audiences and increases interaction with your customers. You have to do more than embrace technology, you need to leverage it and let new opportunities catapult your business forward.
As intertwined as technology is, we can no longer afford to have its operational details hidden away in the bowels of the "tech ops" or "web ops" group. We need visibility and we need cohesion. Infrastructure/application engineering and other business units are now, more than ever before, on the same team marching towards success. Communication and accountability are critical to success.
Here is where I leave you and hope that you will think about the metrics you monitor in a different light. They represent something more. They are there to make the business run, increase shareholder value, make your customers happier and more prosperous.
2011-01-19 18:27:31 Past Performance: does this look right to you?
by Theo Schlossnagle
If you are like me, you look at a lot of data. I look at data in spreadsheets, I look at data on P&L statements, I look at term sheets, I look at systems data — a lot of systems data. I find the best way to look at data is to visualize it because it is the fastest way to get data into the amazing pattern matcher that is the human brain.
The human brain is quite good at saying “this is abnormal” and can usually even articulate why. This curve has a periodicity, that one a monotonic behavior, another is simply always flat… then they “change.” When we say “this visualization looks wrong,” we are almost always onto something real in the numbers. I'll give you a simple visual example:
While there is obviously something starting at 8pm, we are only left with another question: “is it out of the ordinary?” It doesn't look like that today, and it doesn't appear to resemble the day before. What about last week? Let's start the graph one week earlier:
This tells us a lot. It looks like we have a very similar event last week at this time. With most analysis tools, you stop here (or you hover with you mouse and try to correlate start/end times and magnitude to better understand how these two events resemble each other).
With Circonus, we don't leave it here. Instead, we provide tools to help compare time separated events using our data overlay feature. We can take our original two-day view and overlay the data from last week right on top of (or in this case underneath).
Just two clicks and we've got a one-week offset data overlay and the visualization lends a little insight into what is going on. We can see the start times are identical, but the event from this week ends about 30 minutes before the one from last week — largely the same though.
Again, we find that visuals help. Understanding how these graph differ even when they are right on top of each other can be a bit challenging. Never fear! We've added help in the legend.
The legend takes on some new features when data overlays are in use. You now get a very clear, side-by-side read-out of the data in the graph including percentage differences. Additionally, the arrows that say “you're higher than you were last week” become more saturated (redder) as the different in the data increases and fade to light grey if the two values are more similar. This makes it simple to quickly understand how current performance really compares to past performance. So, the interesting part of this graph is actually the subsequent spike of inbound traffic this is up 95% over last week. That's something to look into.
2011-01-05 20:11:49 Capacity Planning Made Easy
by Theo Schlossnagle
Okay, so capacity planning will never be fool proof. You simply cannot predict the future. However, some of the time you have a darn good idea of what the future will hold. Since someone knows what is likely to happen, why is it so hard to plan marketing initiatives, funnels and IT provisioning?
The reason is that things aren't always linearly correlated. What's that mean? Linear correlation goes something like this: if A depends upon B and I want twice as much A, I'll need twice as much B. While correlating non-linear systems is what I like to call BFM a lot can be done with linear regressions. The problem with any regression is that you need to put real numbers in, get real numbers out and understand how good they are.
When we look at how something grows, one of the most common tools in the statistics arsenal is a least-squared linear regression. That is: given a set of datapoints, what line best fits them? So, let's say we have a lot of datapoints (boy do we have a lot of datapoints!). Now what does a linear regression tell us?
Let's assume we're looking at some traffic data over the month of December.
In this graph, it can be very hard to answer questions about the nature of the data. Two common questions are:
- are we growing or shrinking and by how much?
- if we stay on the current growth path, where will we be some point in the future?
Enter the linear regression:
Answering the first question is pretty simple now. We can look at the value on the left side of the graph, and the right side of the graph and do the math. You can't see it in the screenshot, but the left the values are 5.49M and 5.88M which is roughly a 6.6% growth over 4 weeks. Now, any statistician will scream bloody murder about confidences in the data and model and any engineer will simply ask: "does that make sense?" Maybe we'll look over 8 weeks and twelve weeks also to make sure that we build our confidence (this can be easier, though far less scientific, than understanding R2 values - which are, of course, available as well). Honestly, I personally find that reconciling this with my expectations is one of the better methods of trusting the model.
Let's assume that we we expected some increase in resource usage during this time frame and that 6% is reasonable. Now onto the next question: where will we be in the future. In Circonus, we just jump up and extend our view window out one year and we can see what our model looks like in the future:
Next December we'll be using 10.91M (this just happens to be MBits/s of network bandwidth to serve origin dynamic content on one of the sites managed over at OmniTI). We'll revisit this month by month to ensure that we are indeed heading where we expected. It allows engineers and marketers and executives alike put real numbers into (what we call) napkin math which adds peace, clarity and allows most people to do easier what-if pontification. I can tell you one thing... we sleep better at night knowing specific numbers about a probable future.
2010-12-10 21:14:35 Enterprise Agents
by Theo Schlossnagle
If you're like me, your first response to SaaS monitoring was: "You can't see my machines/services/metrics from your cloud. That won't be too useful." With a little bit of thought, it's pretty easy to arrive at the conclusion that you must run something on your infrastructure to bridge the divide. It was a fun and exciting project here to build that magic something called the Circonus Enterprise Agent.
The Circonus Enterprise Agent (we'll call it our "EA" from here on) is all of our magic monitoring software bundled into a maintainable VMWare virtual appliance than can be run on your internal networks to track stuff that the public shouldn't be seeing. We had some interesting choices to make during development and I thought I'd share what they were and why we made them.
Choosing a platform
Most of our internal infrastructure runs on some variant of OpenSolaris technology. We chose this for a variety of reasons. Most importantly, storing your precious data on ZFS seemed like the right thing to do. After that, the fault management architecture (FMA) available in OpenSolaris allows us to keep our machines and services running more reliably. Reliability and data permanence are the two most important factors in technology selection here at Circonus (a fact our customers respect).
So, with all this talk about OpenSolaris and its advantages you'd imagine we built our EA on the same technology, right? Not so simple. For a virtual appliance image that is easy to administer and easy to upgrade in the field you need a good package management system. OpenSolaris simply falls on its face there. Oracle's promises of IPS (the new and coming package management system for Solaris 11) are quite compelling, but that is just a promise today. Instead, we turned to the tried and true CentOS Linux-based platform for our EA.
CentOS provides all the features we need to run our agent software, manage package upgrades and distribution seamlessly and simply, and the core operating system is both stable and secure. In an interesting later development, we provide Joyent customers the ability to run an EA on one of their Joyent SmartMachines. Joyent's operating architecture is actually derived from OpenSolaris — so we ended up porting our EA back to our core platform as well.
Today, the EA is available in two forms: a CentOS 5 VMWare-based appliance and a Joyent SmartMachine.
From where do you manage the appliance
While most appliances have a web console that allows a variety of management tasks, we made a simple choice to have the appliance administrable via the main circonus.com web application. This is where Circonus users interface with all their data and set up their monitors, sp it only made sense to also administer their EA from the same place.
After using the system for a while now, I can say that I'm really pleased with this decision. The cohesiveness of scheduling checks on either your private EA and/or the world-wide Circonus agents through the same check creation interface is a simple pleasure. One single world-wide view of all the agents on which you can schedule checks makes it simple to understand how the monitoring system works.
What to automate
Generally speaking, when you think appliance, you think self-maintaining. That's not an unreasonable expectation. However, this directly conflicts with our experience in operations. In operations, automatic upgrades of software are strictly taboo. Typically, the operations crew wants to schedule precisely when an upgrade will occur, be present and have a bulletproof evaluation and rollback plan. When you start talking about critical infrastructure like monitoring, "typically" becomes "always."
With this in mind, we made the upgrade process on the EA completely automated, but not automatic. One click and the appliance will self-upgrade. Currently, this is the only ongoing task that is done from the appliance itself (rather than the circonus.com portal), but we're looking to make some nice enhancements there as well. Soon, you'll be able to trigger remote EA upgrades directly from the web application.
What you get
With an EA you get to leverage the power of Circonus against all of your private data. Networks, systems, applications and business systems that are only accessible via internal infrastructure can be monitored via an Enterprise Agent. The data is fed back to the Circonus cloud in real-time. All of that data can be alerted on, and is available for correlation, trending and planning purposes through the excellent Circonus tools you already know and love.