Drastic Elastic [Part 2]: The aggregation framework

Notice:
This post is older than 5 years – the content might be outdated.

Following from my earlier article on elasticsearch-as-a-database, we will now take a look at the aggregation framework.

Aggregations

But first the bad news – we have to admit, we abused this feature … big time. The documentation illustrates use of the aggregation framework for things such as summary, or top X, queries („give me the top 10 best-selling cars by country“), but aggregating data over an entire index is something else entirely.

Basic concepts

Aggregations return key/value pairs for each aggregation level, together with one or more aggregated metric (max, min, sum,..). They can be nested (see next section), and cannot be used on analyzed fields. A simple aggregation over all types held in an index would look like this:

curl -XGET 'localhost:9200/myIndex/_search?search_type=count&pretty' -d '{

    "aggs": {

        "count_by_type": {

            "terms": {

                "field": "_type"

            }

        }

    }

}'

curl -XGET 'localhost:9200/myIndex/_search?search_type=count&pretty' -d '{

"aggs": {

"count_by_type": {

"terms": {

"field": "_type"

}

which is comparable to a simple group-by query in a relational database:

select someColumn1, count(*) from myTable group by someColumn1

1	select someColumn1, count(*) from myTable group by someColumn1

It is possible to nest such queries in elasticsearch (just as one could do with a RDBMS), so that a group operation such as

select someColumn1, someColumn2, someColumn3, count(*) from myTable group by someColumn1, someColumn2, someColumn3

1	select someColumn1, someColumn2, someColumn3, count(*) from myTable group by someColumn1, someColumn2, someColumn3

can be achieved by nesting aggregations. But your memory could explode if this is done excessively.

So if you should not nest aggregations, but you wish to wish to group on more one column, what are the alternatives? Our approach was to combine fields needed for „nested“ aggregations in a single, composite field in our bulk import step, so that we could aggregate on this single field, but also parse out individual fields before persisting the aggregated result. e.g.

on bulk import, write a composite column made up of three columns like this (using a suitable delimiter)
- field1: „value1“
- field2: „value2“
- field3: „value3“
  - groupField:“value1;value2;value3″
aggregate on groupField
having accessed the aggregation result, split the key (groupField) back into three individual columns (field1, field2, field3) and then write row back into a new index (one for holding aggregated results) in elasticsearch.

In this way we can be flexible in the way we aggregate without placing an impossible burden on memory. In fact, we are able to aggregate over entire indexes (often holding 100s of thousands of documents) at regular intervals without disturbing the bulk imports which take place every 5 minutes. We took the precaution of carrying out aggregations on 2 dedicated client nodes (that are not subject to bulk import loads).

Example:

curl -XGET 'localhost:9200/myIndex/myType/_search?pretty' -d '{

	"_source": ["group_field", "fieldToAggregate"],

	"size": 1000,

	"query": {

		"filtered": {

			"query": {

				"term": {

					"somefield": "value1" <== aggregations can be combined with filter terms

				}

			}

		}

	},

	"aggs": {

		"grp": {

			"terms": {

				"field": "group_field",

				"size": 0

			},

			"aggs": {

				"sum_fieldToAggregate": {

					"sum": {

						"field": "fieldToAggregate"

					}

				},

				"max_fieldToAggregate": {

					"max": {

						"field": "fieldToAggregate"

					}

				},

				"min_fieldToAggregate": {

					"min": {

						"field": "fieldToAggregate"

					}

				}

			}

		}

	}

}'

curl -XGET 'localhost:9200/myIndex/myType/_search?pretty' -d '{

"_source": ["group_field", "fieldToAggregate"],

"size": 1000,

"query": {

"filtered": {

"query": {

"term": {

"somefield": "value1" <== aggregations can be combined with filter terms

}

"aggs": {

"grp": {

"terms": {

"field": "group_field",

"size": 0

"aggs": {

"sum_fieldToAggregate": {

"sum": {

"field": "fieldToAggregate"

}

"max_fieldToAggregate": {

"max": {

"field": "fieldToAggregate"

}

"min_fieldToAggregate": {

"min": {

"field": "fieldToAggregate"

}

The aggregation result will then look something like this:

...

"aggregations": {

	"grp": {

		"buckets": [{

			"key": "value1;value2;value3",

			"doc_count": 1,

			"max_fieldToAggregate": {

				"value": 456.0

			},

			"min_fieldToAggregate": {

				"value": 123.0

			},

			"sum_fieldToAggregate": {

				"value": 640000.0

			}

		}]

	}

}

...

"aggregations": {

"grp": {

"buckets": [{

"key": "value1;value2;value3",

"doc_count": 1,

"max_fieldToAggregate": {

"value": 456.0

"min_fieldToAggregate": {

"value": 123.0

"sum_fieldToAggregate": {

"value": 640000.0

}

}]

}

where the key refers to our groupField, made of known, ordered sub-fields (which we defined on insert).

drastic-elastic-importer-aggregator

Aggregate

choose either cluster- or external-aggregation mode (see below for the differences)
data is aggregated/grouped on identical values of the designated group field in the document, each metric (sum/min/max/values-per-second etc.) is applied to this group
the “key“ (=group value) is then used to define/retrieve all fields and values needed to construct a document with the proper mapping format
each “aggregate“ document is written to ES
a number of rows will be chosen randomly from the aggregation list, each aggregated row being then verified by selecting all individual documents belonging to the group and calculating/comparing metrics in line

Cluster vs. External … what does this mean?

Cluster: Here we take advantage of ES-aggregations to return us the grouped results: ES does the heavy lifting for us, but any errors are obfuscated and hard to identify.

External: We fetch all data in packets via scan-and-scroll and build up our own hash table outside of ES: Errors are far easier to identify and debug/fix, we can dispense with the group field alltogether, as it is only needed for cluster based aggregating (saving up to 50% disk space), we have to do the heavy lifting ourselves => memory/CPU stress outside of cluster where processes cannot be run in parallel.

Aggregations: Lessons learned

they rock!
do not nest them otherwise you could have issues with RAM
take care with (but don’t necessarily avoid) setting size=0 (which aggregates over the entire index, as we did) as this was not really the intention of the engineers that designed the aggregation framework
it is difficult to validate an aggregation result. We actually built our aggregating tool with an „external“ mode, so that could scan-and-scroll all documents in an index, building up a hash table external to elasticsearch to validate results
using „copy values“ was not useful to us as it was not able to maintain a strict order of fields (this feature is intended for searching over multiple fields)
use your own grouping column as described above
use dedicated nodes for aggregation operations

Read on …

So you’re interested in search based applications, text analytics and enterprise search solutions? Have a look at our website and read about the services we offer to our customers.

Join us!

Are you looking for a job in big data processing or analytics? We’re currently hiring!

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	Leadfeeder
Anbieter	Dealfront Group GmbH

Drastic Elastic [Part 2]: The aggregation framework

Aggregations

Basic concepts

Aggregate

Cluster vs. External … what does this mean?

Aggregations: Lessons learned

Read on …

Join us!

Read the complete series

5 Kommentare

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

Suchtechnologien in Vergleichsstudie: Schlagwortsuche vs. Synonymsuche vs. Bi-Encoder

Planning High Throughput Elasticsearch Clusters: Part 1

Elk on Docker (-Compose)

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Akzeptieren
Name	Podigee
Anbieter	Podigee
Zweck	Wird verwendet, um Podigee-Inhalte automatisch zu entsperren.
Datenschutzerklärung	https://www.podigee.com/de/ueber-uns/datenschutz
Host(s)	podigee., podigee.com, podigee.io

Drastic Elastic [Part 2]: The aggregation framework

Aggregations

Basic concepts

Aggregate

Cluster vs. External … what does this mean?

Aggregations: Lessons learned

Read on …

Join us!

Read the complete series

5 Kommentare

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

Suchtechnologien in Vergleichsstudie: Schlagwortsuche vs. Synonymsuche vs. Bi-Encoder

Planning High Throughput Elasticsearch Clusters: Part 1

Elk on Docker (-Compose)

inoNews