Alle Beiträge von Andrew Kenworthy

Andrew Kenworthy
Having graduated in Civil Engineering from Cambridge University in the UK I worked for a multinational construction company, where I had my first taste of software development. I now work as an Architect/Developer at inovex, specialising in BigData and Business-Intelligence technologies, with a particular interest in real-time scenarios.

HBase and Phoenix on Azure: adventures in abstraction

One of my favourite essays by Joel Spolsky (he of Stack Overflow fame) is “The law of leaky abstractions”. In it he describes how the prevalence of layers of abstraction – be it coding languages or libraries or frameworks – have helped us accelerate our productivity. We don’t have to talk directly to a database engine because we can let our SQL do that for us; we don’t have to implement map reduce jobs in java anymore because we can use Hive; we don’t have to… well, you get the idea. Weiterlesen

Storm in a Teacup

I wanted to call this blog article something like „Storm in a Nutshell“ but decided against it as

  1. there is probably a book by that name out there somewhere, and I wanted to avoid any unannounced visits in the dead of night from shady-looking types from the copyright police, and
  2. I really wanted to use a corny pun.

So think of a teacup as conceptually similar to a nutshell, but bigger.

On a recent project, we used Apache Storm as the real-time component of a complex, cloud-based environment used for fraud detection. In this article I would like to offer an introductory overview of storm, showing how to define a simple spout and bolt, as well as highlighting some of the issues that are important when building storm topologies. Weiterlesen

Drastic Elastic [Part 4]: Aggregations & Plugins

In an earlier post in this mini-series I mentioned that the aggregated data we persist in ELasticSearch has discrete retention times:

  • 5 minute aggregation => (retention time of) one day
  • hourly aggregations => 7 days
  • daily aggregations => 5 years

This means that we reach well over 50% of our total data retention after one week (the only additions after that point are daily aggregations while data at other aggregation levels gets refreshed/updated) – after 4 or 5 weeks we had something like 8 billion documents in ElasticSearch amounting to 13 TB of data.

In this last article of our four part series we describe how ElasticSearch plugins help us to address appropriate aggregation levels without having to build in extra round trips (adding to latency), or to fetch more data than we need (which would require filtering in the client). Weiterlesen

Drastic Elastic [Part 3]: Cluster Setup

ElasticSearch does not offer support for clusters spanning data centres. However, on our project we had access to a network latency of 400 *micro*seconds (0.4 ms) between three separate locations in the same city, and decided to test a cluster spanning all three data centres. Network latency did not prove to be a problem, but a more tricky issue was deciding how to set up the cluster to best guard against network partitioning. Weiterlesen

Drastic Elastic [Part 1]: ElasticSearch as a Database

In an article for Java Magazin way back in 2012 (only a small section of it seems to have survived online(!), although it is still available from the inovex website as a download) I toyed with the idea of using a search engine as a database (not such an unconventional idea, it turned out, since Elastic from time to time decribes its search engine as being a database too), mainly due to cost and usability considerations. The idea gained traction with the release of the aggregation framework in early 2015 and a few months later I was involved with a project where we decided to leverage elasticsearch aggregations for the analysis of internet statistics. In this article I want to share my experience. Weiterlesen