Writing a Hive UDF for lookups
Data Engineering

Writing a Hive UDF for lookups

Lesezeit
7 ​​min

Notice:
This post is older than 5 years – the content might be outdated.

In today’s blog I am going to take a look at a fairly mundane and unspectacular use of a Hive UDF (user-defined function), that of performing lookups against resources residing in the Hadoop file system (HDFS), specifically, other hive tables. Why would we do this when Hive provides this functionality via joins and the like? Well, non-equi joins (e.g. joins using a range of values) are not allowed in Hive, so in these cases the only options are to join on non-strict criteria (and then filter), or to write your own UDF, which is what we look at here.

Reading Hive resources should be fairly simple: after all Hive’s metastore knows all about its own HDFS resources, and so we can read the data into some kind of in-memory map and perform lookups to our heart’s content.

No problem, we say to ourselves, we’ll just write a UDF that executes an HCatalog call against the metastore. So we set off on our Hive UDF odyssey, draft and deploy our first HCatalog-enabled lookup-tool, go off to enjoy a coffee, and then return to find that we have killed the metastore: maybe unleashing that job with oh-so-many mappers was not such a hot idea after all. Hmm….

An example Hive UDF

OK, so the HCatalog idea was nice, but let’s rein in our enthusiasm slightly and go a bit more low-level: we will write a UDF (in Java, not in Python) to take an HDFS-path as one of its arguments. This will at least avoid addressing the metastore. Our skeleton- (and, for sake of space, simplified-) UDF will look something like this:

with a callout such as:

where we supply some customer number (12345), a tax code (100), an active date as in integer (20180101), and the path to the hdfs resource.

What is immediately obvious, though, is that this is not very elegant: we are expecting the end-user to know all the grubby details about which resources exist, and where they reside—right down to part-file name. Surely we can do better—how about hard-coding this into our UDF code? That would work but would also introduce a clear dependency, with the UDF being rendered invalid each and every time the resources no longer match, and, even worse, one has to have the code available to know which files we (the developers) have to ensure are available. Is there no way to provide this information at deploy time?

Yes…and no. We can’t provide any variables as part of our CREATE FUNCTION DDL, but we can—as of Hive 0.13 (see link below)—add resources, rather like we do when we define which .jar file to use. The command looks like this:

where we can specify a resource when we create the UDF. In this way, dependencies are at least documented along with the UDF definition, which is progress of sorts. The Hive UDF is created for each session and, upon creation, the HDFS resource is copied to a local folder, from where we reference it like this:

So we deploy and test the UDF, only to find at times that we are confronted with messages informing us that the file cannot be found.

This happens when we start a job that runs in map-reduce mode in the cluster, from where the UDF cannot see the local folder holding our resources. How we can write our resource-lookup code to be flexible enough to cope with both scenarios: local- or M/R-mode?

We can cover both eventualities, by dropping down to a second option if the first cannot detect a file. Note that for M/R mode, the file is available via the distributed cache, which is local to where the UDF .jar has been started:

Reviewing our approach

So now we are done—let’s take a step back and review some of the drawbacks to this approach:

  1. we have to ensure that the single part file referenced in the UDF DDL is exactly and always present in HDFS (and in this context it is important to note that different tools – e.g. hive, spark – create part files with different naming conventions), and in the expected format that makes it simple to parse (e.g. pipe- or comma-delimited data)
  2. the file resource is integral to the function object: functions cannot be dropped if a FILE resource referenced in the CREATE STATEMENT no longer exists!
  3. the resource is copied to the local resources folder whenever it is instantiated by a session that invokes the UDF: so the resource can become stale if the lookup data changes within the scope of a session

As mentioned at the start, it may well be that a UDF is superfluous to your requirements. In the case of equi-JOINs, hive will normally persist small joined tables in the distributed cache, and then reference them in much the same way as we have shown above. For non-equi-JOINs, though, that is not possible and a lookup against a small-ish dataset via UDF is worth considering (or you could perform the join, excluding the column where a non-equi join would be used, and then filter in the WHERE clause).

So to conclude, we should try and balance the following considerations when using a Hive UDF for table lookups:

  1. Do we want to use the metastore? – HCatalog calls from every mapper may cause problems, although this may be the cleanest implementation
  2. Do we require the user to know about HDFS resources? – an alternative to HCatalog is to perform lookups directly against HDFS paths, but this requires the UDF-callout (and hence the user) to include the address of the HDFS resource; or we can“embed“ the resource in the CREATE FUNCTION definition
  3. Are we performing lookups against dynamic data? –  if so, make sure that it does not change in the course of your session
  4. Avoid assumptions about local- or yarn-mode – ideally we want our UDF to be insulated against the mode of operation

Links

Apache Hive Wiki: Create/Drop/Reload Function

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert