Writing a Hive UDF for lookups

Notice:
This post is older than 5 years – the content might be outdated.

In today’s blog I am going to take a look at a fairly mundane and unspectacular use of a Hive UDF (user-defined function), that of performing lookups against resources residing in the Hadoop file system (HDFS), specifically, other hive tables. Why would we do this when Hive provides this functionality via joins and the like? Well, non-equi joins (e.g. joins using a range of values) are not allowed in Hive, so in these cases the only options are to join on non-strict criteria (and then filter), or to write your own UDF, which is what we look at here.

Reading Hive resources should be fairly simple: after all Hive’s metastore knows all about its own HDFS resources, and so we can read the data into some kind of in-memory map and perform lookups to our heart’s content.

No problem, we say to ourselves, we’ll just write a UDF that executes an HCatalog call against the metastore. So we set off on our Hive UDF odyssey, draft and deploy our first HCatalog-enabled lookup-tool, go off to enjoy a coffee, and then return to find that we have killed the metastore: maybe unleashing that job with oh-so-many mappers was not such a hot idea after all. Hmm….

An example Hive UDF

OK, so the HCatalog idea was nice, but let’s rein in our enthusiasm slightly and go a bit more low-level: we will write a UDF (in Java, not in Python) to take an HDFS-path as one of its arguments. This will at least avoid addressing the metastore. Our skeleton- (and, for sake of space, simplified-) UDF will look something like this:

public class LookupTaxCode extends GenericUDF {

	private ByteObjectInspector customerInspector;

	private ByteObjectInspector taxCodeInspector;

	private IntObjectInspector dateInspector;

	private StringObjectInspector fileInspector;

	/*

	 * this will be initialized in the initMap method: group by customer and

	 * lower-range value (assuming there are no gaps)

	 */

	private Map<Integer, Map<Integer, NavigableMap<Integer, HiveDecimal>>> lookup;

	@Override

	public ObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {

		if (args.length < 3) {

			throw new UDFArgumentLengthException(

					"This function needs a minimum of 3 arguments - customer, taxcode-ID, (active) date "

							+ "plus - optionally - source file (pipe-delimited)!");

		}

		this.customerInspector = (ByteObjectInspector) args[0];

		this.taxCodeInspector = (ByteObjectInspector) args[1];

		this.dateInspector = (IntObjectInspector) args[2];

		if (args.length > 3) {

			this.fileInspector = (StringObjectInspector) args[3];

		}

		return PrimitiveObjectInspectorFactory.writableHiveDecimalObjectInspector;

	}

	@Override

	public HiveDecimalWritable evaluate(DeferredObject[] args) throws HiveException {

		/* initialize lookup, if not yet done */

		if (lookup == null) {

			if (args.length > 3) {

				initHdfsLookup(fileInspector.getPrimitiveJavaObject(args[3].get()));

			}

		}

		/* perform lookup */

		int customer = (int) customerInspector.get(args[0].get());

		int taxCodeId = (int) taxCodeInspector.get(args[1].get());

		int dateFrom = dateInspector.get(args[2].get());

		if (lookup.containsKey(customer) && lookup.get(customer).containsKey(taxCodeId)) {

			NavigableMap<Integer, HiveDecimal> rr = lookup.get(customer).get(taxCodeId);

			Entry<Integer, HiveDecimal> floorEntry = rr.floorEntry(dateFrom);

			return new HiveDecimalWritable(floorEntry == null ? HiveDecimal.create(0) : floorEntry.getValue());

		} else {

			return null;

		}

	}

	private void initHdfsLookup(String lookupFile) throws HiveException {

		try {

			Configuration conf = new Configuration();

			Path filePath = new Path(lookupFile);

			FileSystem fs = FileSystem.get(filePath.toUri(), conf);

			FSDataInputStream in = fs.open(filePath);

			initMap(in);

		} catch (Exception e) {

			throw new HiveException(e + ": when attempting to access: " + lookupFile);

		}

	}

	protected void initMap(InputStream in) throws IOException {

		/*

		 * perform some lookup logic from named hdfs file here...

		 */

	}

	@Override

	public String getDisplayString(String[] args) {

		return "Method call: lookup_taxcode(" + args[0] + ", " + args[1] + ", " + args[2] + ")";

	}

}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

public class LookupTaxCode extends GenericUDF {

private ByteObjectInspector customerInspector;

private ByteObjectInspector taxCodeInspector;

private IntObjectInspector dateInspector;

private StringObjectInspector fileInspector;

* this will be initialized in the initMap method: group by customer and

* lower-range value (assuming there are no gaps)

private Map>> lookup;

@Override

public ObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {

if (args.length < 3) {

throw new UDFArgumentLengthException(

"This function needs a minimum of 3 arguments - customer, taxcode-ID, (active) date "

+ "plus - optionally - source file (pipe-delimited)!");

}

this.customerInspector = (ByteObjectInspector) args[0];

this.taxCodeInspector = (ByteObjectInspector) args[1];

this.dateInspector = (IntObjectInspector) args[2];

if (args.length > 3) {

this.fileInspector = (StringObjectInspector) args[3];

}

return PrimitiveObjectInspectorFactory.writableHiveDecimalObjectInspector;

}

@Override

public HiveDecimalWritable evaluate(DeferredObject[] args) throws HiveException {

/* initialize lookup, if not yet done */

if (lookup == null) {

if (args.length > 3) {

initHdfsLookup(fileInspector.getPrimitiveJavaObject(args[3].get()));

}

/* perform lookup */

int customer = (int) customerInspector.get(args[0].get());

int taxCodeId = (int) taxCodeInspector.get(args[1].get());

int dateFrom = dateInspector.get(args[2].get());

if (lookup.containsKey(customer) && lookup.get(customer).containsKey(taxCodeId)) {

NavigableMap rr = lookup.get(customer).get(taxCodeId);

Entry floorEntry = rr.floorEntry(dateFrom);

return new HiveDecimalWritable(floorEntry == null ? HiveDecimal.create(0) : floorEntry.getValue());

} else {

return null;

}

private void initHdfsLookup(String lookupFile) throws HiveException {

try {

Configuration conf = new Configuration();

Path filePath = new Path(lookupFile);

FileSystem fs = FileSystem.get(filePath.toUri(), conf);

FSDataInputStream in = fs.open(filePath);

initMap(in);

} catch (Exception e) {

throw new HiveException(e + ": when attempting to access: " + lookupFile);

}

protected void initMap(InputStream in) throws IOException {

* perform some lookup logic from named hdfs file here...

}

@Override

public String getDisplayString(String[] args) {

return "Method call: lookup_taxcode(" + args[0] + ", " + args[1] + ", " + args[2] + ")";

}

with a callout such as:

select lookup_taxcode(12345, 100, 20180101, '/hdfs/path/to/partfile') from ...

1	select lookup_taxcode(12345, 100, 20180101, '/hdfs/path/to/partfile') from ...

where we supply some customer number (12345), a tax code (100), an active date as in integer (20180101), and the path to the hdfs resource.

What is immediately obvious, though, is that this is not very elegant: we are expecting the end-user to know all the grubby details about which resources exist, and where they reside—right down to part-file name. Surely we can do better—how about hard-coding this into our UDF code? That would work but would also introduce a clear dependency, with the UDF being rendered invalid each and every time the resources no longer match, and, even worse, one has to have the code available to know which files we (the developers) have to ensure are available. Is there no way to provide this information at deploy time?

Yes…and no. We can’t provide any variables as part of our CREATE FUNCTION DDL, but we can—as of Hive 0.13 (see link below)—add resources, rather like we do when we define which .jar file to use. The command looks like this:

DROP FUNCTION IF EXISTS lookup_taxcode

;

CREATE FUNCTION lookup_taxcode AS 'my.package.LookupTaxCode'

USING JAR 'hdfs://path/to/udf.jar',

FILE 'hdfs://path/to/relevant/partfile'

;

DROP FUNCTION IF EXISTS lookup_taxcode

;

CREATE FUNCTION lookup_taxcode AS 'my.package.LookupTaxCode'

USING JAR 'hdfs://path/to/udf.jar',

FILE 'hdfs://path/to/relevant/partfile'

;

where we can specify a resource when we create the UDF. In this way, dependencies are at least documented along with the UDF definition, which is progress of sorts. The Hive UDF is created for each session and, upon creation, the HDFS resource is copied to a local folder, from where we reference it like this:

@Override

public HiveDecimalWritable evaluate(DeferredObject[] args) throws HiveException {

	/*

	 * initialize lookup, if not yet done

	 */

	if (lookup == null) {

		if (args.length > 3) {

			initHdfsLookup(fileInspector.getPrimitiveJavaObject(args[3].get()));  // previously: lookup has to provided as an argument

		} else {

			initCacheLookup(getResourcePath()); // now: lookup up from path defined in CREATE FUNCTION

		}

	}

	/*

	 * perform lookup

	 */

	...

}

private void initCacheLookup(String lookupFile) throws HiveException {

	InputStream in;

	try {

		in = new FileInputStream(getLookupFile(lookupFile));

		initMap(in);

	} catch (Exception e) {

		throw new HiveException(e + ": when attempting to access: " + lookupFile);

	}

}

protected File getLookupFile(String lookupFile) {

	/* N.B. local resources (non-MR mode) */

	File resourceDir = new File(SessionState.get().getConf().getVar(HiveConf.ConfVars.DOWNLOADED_RESOURCES_DIR));

	File f = new File(resourceDir, lookupFile);

	return f;

}

protected String getResourcePath() throws HiveException {

	return PART_FILE;

}

@Override

public HiveDecimalWritable evaluate(DeferredObject[] args) throws HiveException {

* initialize lookup, if not yet done

if (lookup == null) {

if (args.length > 3) {

initHdfsLookup(fileInspector.getPrimitiveJavaObject(args[3].get())); // previously: lookup has to provided as an argument

} else {

initCacheLookup(getResourcePath()); // now: lookup up from path defined in CREATE FUNCTION

}

* perform lookup

...

}

private void initCacheLookup(String lookupFile) throws HiveException {

InputStream in;

try {

in = new FileInputStream(getLookupFile(lookupFile));

initMap(in);

} catch (Exception e) {

throw new HiveException(e + ": when attempting to access: " + lookupFile);

}

protected File getLookupFile(String lookupFile) {

/* N.B. local resources (non-MR mode) */

File resourceDir = new File(SessionState.get().getConf().getVar(HiveConf.ConfVars.DOWNLOADED_RESOURCES_DIR));

File f = new File(resourceDir, lookupFile);

return f;

}

protected String getResourcePath() throws HiveException {

return PART_FILE;

}

So we deploy and test the UDF, only to find at times that we are confronted with messages informing us that the file cannot be found.

This happens when we start a job that runs in map-reduce mode in the cluster, from where the UDF cannot see the local folder holding our resources. How we can write our resource-lookup code to be flexible enough to cope with both scenarios: local- or M/R-mode?

We can cover both eventualities, by dropping down to a second option if the first cannot detect a file. Note that for M/R mode, the file is available via the distributed cache, which is local to where the UDF .jar has been started:

protected File getLookupFile(String lookupFile) {

	/* distributed cache */

	File f = new File(lookupFile); // file available locally

	if (!f.exists()) {

		/* local resources (non-MR mode) */

		File resourceDir = new File(

				SessionState.get().getConf().getVar(HiveConf.ConfVars.DOWNLOADED_RESOURCES_DIR));

		f = new File(resourceDir, lookupFile);

	}

	return f;

}

protected File getLookupFile(String lookupFile) {

/* distributed cache */

File f = new File(lookupFile); // file available locally

if (!f.exists()) {

/* local resources (non-MR mode) */

File resourceDir = new File(

SessionState.get().getConf().getVar(HiveConf.ConfVars.DOWNLOADED_RESOURCES_DIR));

f = new File(resourceDir, lookupFile);

}

return f;

}

Reviewing our approach

So now we are done—let’s take a step back and review some of the drawbacks to this approach:

we have to ensure that the single part file referenced in the UDF DDL is exactly and always present in HDFS (and in this context it is important to note that different tools – e.g. hive, spark – create part files with different naming conventions), and in the expected format that makes it simple to parse (e.g. pipe- or comma-delimited data)
the file resource is integral to the function object: functions cannot be dropped if a FILE resource referenced in the CREATE STATEMENT no longer exists!
the resource is copied to the local resources folder whenever it is instantiated by a session that invokes the UDF: so the resource can become stale if the lookup data changes within the scope of a session

As mentioned at the start, it may well be that a UDF is superfluous to your requirements. In the case of equi-JOINs, hive will normally persist small joined tables in the distributed cache, and then reference them in much the same way as we have shown above. For non-equi-JOINs, though, that is not possible and a lookup against a small-ish dataset via UDF is worth considering (or you could perform the join, excluding the column where a non-equi join would be used, and then filter in the WHERE clause).

So to conclude, we should try and balance the following considerations when using a Hive UDF for table lookups:

Do we want to use the metastore? – HCatalog calls from every mapper may cause problems, although this may be the cleanest implementation
Do we require the user to know about HDFS resources? – an alternative to HCatalog is to perform lookups directly against HDFS paths, but this requires the UDF-callout (and hence the user) to include the address of the HDFS resource; or we can“embed“ the resource in the CREATE FUNCTION definition
Are we performing lookups against dynamic data? – if so, make sure that it does not change in the course of your session
Avoid assumptions about local- or yarn-mode – ideally we want our UDF to be insulated against the mode of operation

Links

Apache Hive Wiki: Create/Drop/Reload Function

Dataprocessing mit Spark (Batch & Stream) Training

Die Teilnehmenden lernen in diesem Hands-on-Kurs, wie moderne Lakehouse-Architekturen in der Databricks Cloud mittels Spark (Verarbeitung) und Delta Lake (Storage) aufgebaut werden können.

Zum Training

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	Leadfeeder
Anbieter	Dealfront Group GmbH

Writing a Hive UDF for lookups

An example Hive UDF

Reviewing our approach

Links

Dataprocessing mit Spark (Batch & Stream) Training

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

Lightweight Data Quality Frameworks: DQX for Apache Spark

Data Fabric Explained: Architecture, Benefits & Comparison with Data Mesh

How to Use Mimesis and dbt to Test Data Pipelines

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Akzeptieren
Name	Podigee
Anbieter	Podigee
Zweck	Wird verwendet, um Podigee-Inhalte automatisch zu entsperren.
Datenschutzerklärung	https://www.podigee.com/de/ueber-uns/datenschutz
Host(s)	podigee., podigee.com, podigee.io

Writing a Hive UDF for lookups

An example Hive UDF

Reviewing our approach

Links

Dataprocessing mit Spark (Batch & Stream) Training

Hat dir der Beitrag gefallen? Antwort abbrechen

Ähnliche Artikel

Lightweight Data Quality Frameworks: DQX for Apache Spark

Data Fabric Explained: Architecture, Benefits & Comparison with Data Mesh

How to Use Mimesis and dbt to Test Data Pipelines

inoNews