Drastic Elastic [Part 2]: The aggregation framework

Following from my earlier article on elasticsearch-as-a-database, we will now take a look at the aggregation framework.

Aggregations

But first the bad news – we have to admit, we abused this feature … big time. The documentation illustrates use of the aggregation framework for things such as summary, or top X, queries („give me the top 10 best-selling cars by country“), but aggregating data over an entire index is something else entirely.

Basic concepts

Aggregations return key/value pairs for each aggregation level, together with one or more aggregated metric (max, min, sum,..). They can be nested (see next section), and cannot be used on analyzed fields. A simple aggregation over all types held in an index would look like this:

which is comparable to a simple group-by query in a relational database:

It is possible to nest such queries in elasticsearch (just as one could do with a RDBMS), so that a group operation such as

can be achieved by nesting aggregations. But your memory could explode if this is done excessively.

So if you should not nest aggregations, but you wish to wish to group on more one column, what are the alternatives? Our approach was to combine fields needed for „nested“ aggregations in a single, composite field in our bulk import step, so that we could aggregate on this single field, but also parse out individual fields before persisting the aggregated result. e.g.

  1. on bulk import, write a composite column made up of three columns like this (using a suitable delimiter)
    • field1: „value1“
    • field2: „value2“
    • field3: „value3“
      • groupField:“value1;value2;value3″
  2. aggregate on groupField
  3. having accessed the aggregation result, split the key (groupField) back into three individual columns (field1, field2, field3) and then write row back into a new index (one for holding aggregated results) in elasticsearch.

In this way we can be flexible in the way we aggregate without placing an impossible burden on memory. In fact, we are able to aggregate over entire indexes (often holding 100s of thousands of documents) at regular intervals without disturbing the bulk imports which take place every 5 minutes. We took the precaution of carrying out aggregations on 2 dedicated client nodes (that are not subject to bulk import loads).

Example:

The aggregation result will then look something like this:

where the key refers to our groupField, made of known, ordered sub-fields (which we defined on insert).

drastic-elastic-importer-aggregator

Aggregate

  • choose either cluster- or external-aggregation mode (see below for the differences)
  • data is aggregated/grouped on identical values of the designated group field in the document, each metric (sum/min/max/values-per-second etc.) is applied to this group
  • the ”key“ (=group value) is then used to define/retrieve all fields and values needed to construct a document with the proper mapping format
  • each ”aggregate“ document is written to ES
  • a number of rows will be chosen randomly from the aggregation list, each aggregated row being then verified by selecting all individual documents belonging to the group and calculating/comparing metrics in line

Cluster vs. External … what does this mean?

Cluster: Here we take advantage of ES-aggregations to return us the grouped results: ES does the heavy lifting for us, but any errors are obfuscated and hard to identify.

External: We fetch all data in packets via scan-and-scroll and build up our own hash table outside of ES: Errors are far easier to identify and debug/fix, we can dispense with the group field alltogether, as it is only needed for cluster based aggregating (saving up to 50% disk space), we have to do the heavy lifting ourselves => memory/CPU stress outside of cluster where processes cannot be run in parallel.

Aggregations: Lessons learned

  • they rock!
  • do not nest them otherwise you could have issues with RAM
  • take care with (but don’t necessarily avoid) setting size=0 (which aggregates over the entire index, as we did) as this was not really the intention of the engineers that designed the aggregation framework
  • it is difficult to validate an aggregation result. We actually built our aggregating tool with an „external“ mode, so that could scan-and-scroll all documents in an index, building up a hash table external to elasticsearch to validate results
  • using „copy values“ was not useful to us as it was not able to maintain a strict order of fields (this feature is intended for searching over multiple fields)
  • use your own grouping column as described above
  • use dedicated nodes for aggregation operations

Read on …

So you’re interested in search based applications, text analytics and enterprise search solutions? Have a look at our website and read about the services we offer to our customers.

Join us!

Are you looking for a job in search or analytics? We’re currently hiring Search Engineers as well as Junior and Senior Big Data Scientists.

Read the complete series

comments powered by Disqus