Sunday, April 28, 2013

ELSA Resource Utilization

I've recently received a number of questions on the ELSA mailing list, as well as internally at work, regarding hardware sizing and configuration for ELSA deployments.  Creating a good environment for ELSA requires understanding what each component does, what its resource requirements are, and how it interacts with the other components.  Generally speaking, with the new web services architecture, designing an ELSA architecture has become incredibly simple because the ideal layout is for all boxes to have the same components running.  It really is as simple as adding more boxes, with the small nuance of a possible load balancer in front of multiples.  To see why, let's take a closer look at each of the components.

The Components

An ELSA instance consists of three categories of components for receiving logs: parse, write, and index.  Here they are individually:
  1. Syslog-NG receive/parse
  2. parse/write
  3. MySQL load
  4. Sphinx Search index
  5. Sphinx Search consolidate
Logs are available for search after the initial Sphinx Search indexing occurs, but they must be consolidated to remain on the system for extended periods of time ("temp" indexes versus "permanent" indexes).  Each phase in the life of an inbound log requires varying amounts of CPU and IO time from the system which, together, create the overall maximum event rates for the system.

However, each phase does not use the same amount of IO resources versus CPU resources, and so some of the phases benefit greatly from having at least two CPU's available to run tasks concurrently.  Specifically, a separate CPU is used for Syslog-NG to parse logs versus to parse the output from Syslog-NG.  The loading of logs into MySQL and indexing of logs using Sphinx from MySQL both can occur on separate CPU's, meaning that a total of four CPU's could be used simultaneously, if available.  

Properly selecting an ELSA deployment architecture means providing enough CPU to a node (without wasting resources) as well as ensuring that there is enough available IO to feed those CPU's.  Below is a high-level comparison of which components use a lot of IO versus which use a lot of CPU.  It's far from scientific as represented here, but it does paint a helpful picture of what each component requires for understanding when specing a system.

As the diagram shows, receiving and parsing uses a lot of CPU but not much IO, whereas indexing uses more IO than CPU.  This is a big reason why running the indexing on the same system that is receiving logs makes a lot of sense.  If you separate out boxes into just parsers or just indexers, you are likely to waste IO on one and CPU on the other.  As long as the box you're using has four cores, there isn't a situation in which it helps to have a separate box do the parsing from the box doing the indexing. Separating the duties would only add unnecessary complexity.  If you do decide to split the workloads, be sure to load all ELSA components on both using standard ELSA installation procedures to avoid dependency pitfalls.

Search Components

What about on the search side of things?  Once the indexes are built and available, the web frontend will query Sphinx to find document ID's which correspond to individual events.  It will then take that list of ID's and retrieve them from MySQL.

Almost all of the heavy lifting is done by Sphinx as it searches its indexes for the full-text query given.  It will delve through billions of records and return a list of result doc ID's.  This list of (one hundred, by default) doc ID's are then passed to MySQL for full event retrieval.  The ID's are the MySQL tables' primary key, so this is a very fast lookup.  From a performance and scaling standpoint, ninety-nine percent of the work is done by Sphinx, with MySQL only performing row storage and retrieval.

Within Sphinx, a query is a list of keywords to search for.  Each resultant keyword represents a pseudo-table of result attributes which comprise ELSA attributes (host, class, source IP, etc.).  A very common search result will have a very large pseudo-table, and Sphinx will try to find the best match for the given table.  This means that even though the search is using an index, it could take a long time to find the right match if there are a lot of potential records to filter.  ELSA deals with this by telling Sphinx to timeout after a configured amount of time (ten seconds total, by default) with the best matches it has thus far.  This prevents a "bad" query from taking forever from the user's perspective, and if desired, the user can override this behavior with the timeout directive.

If a query has to scan a lot of these result rows, then the query will be IO-bound.  If it doesn't, then the query will complete in less than a second with very little CPU or IO usage.  It should be noted that temporary indexes do not contain the pseudo-tables, and therefore, queries against temporary indexes (which is almost always the case for alerts), execute in a few milliseconds.  So, the total amount of resources required for queries boils down to how many "bad" queries are issued against the system.  The more queries for common terms, the more IO required, which could cut into IO needed for indexing.

If IO-intensive queries will be frequent, then it might make sense to replicate ELSA data to "slave" nodes using forwarding.  Configuring ELSA to send its already-parsed logs to another instance will allow for that instance to skip the receiving and parsing step and just index records.  It can then serve as a mirror for queries to help share the query load.  This is not normally necessary, but could be desired in certain production environments.

Choosing the Right Hardware

My experience has shown that a single ELSA node will comfortably handle about 10,000 events/second, sustained, even with slow disk.  As shown above,  ELSA will happily handle 50,000 events/second for long periods of time, but eventually index consolidation will be necessary, and that's where the 10,000-30,000 events/second rate comes in.  A virtual machine probably won't handle more than 10,000 events/second unless it has fairly fast disk (15,000 RPM drives, for instance) and the disk is set to "high" in the hypervisor, but a standalone server will be able to run at around 30,000 events/second on moderate server hardware.

I recommend a minimum of two cores, but as described above, there is work enough for four.  RAM requirements are a bit less obvious.  The more RAM you have, the more disk cache you get, which helps performance if an entire index fits on disk.  A typical consolidated ("permanent") index is about 7 gigabytes on disk (for 10 million events), so I recommend 8 GB of RAM for best performance, though 2-4 GB will work fine.

RAM also comes into play in temporary index count.  When ELSA finds that the amount of free RAM has become too small or the amount of RAM ELSA uses has surpassed a configured limit (80 percent and 40 percent, by default, respectively), it will consolidate indexes before hitting its size limit (10 million events, by default).  So, more RAM will allow ELSA to have more temporary indexes and be more efficient about consolidating them.

In conclusion, if you are shopping for hardware for ELSA, you don't need more than four CPU's, but you should try to get as much disk and RAM as possible.