{"id":21040,"date":"2016-11-07T11:25:39","date_gmt":"2016-11-07T10:25:39","guid":{"rendered":"https:\/\/www.inovex.de\/?p=2384"},"modified":"2022-12-01T11:57:33","modified_gmt":"2022-12-01T10:57:33","slug":"hive-udfs-and-udafs-with-python","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/","title":{"rendered":"Hive UDFs and UDAFs with Python"},"content":{"rendered":"<p>Sometimes the analytical power of <a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/LanguageManual+UDF\">built-in Hive functions<\/a> is just not enough. In this case it is possible to write hand-tailored User-Defined Functions (UDFs) for transformations and even aggregations which are therefore called User-Defined Aggregation Functions (UDAFs). In this post we focus on how to write sophisticated UDFs and UDAFs in Python.<!--more--><\/p>\n<p>By sophisticated we mean that our <span class=\"caps\">UD<\/span>(A)Fs should also be able to leverage external libraries like Numpy, Scipy, Pandas etc. This makes things a lot more complicated since we have to provide not only some Python script but also a full-blown virtual environment including the external libraries since they may not be available on the cluster nodes. Therefore, in this tutorial we require only that a basic installation of Python is available on the data nodes of the Hive cluster.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#General-information\" >General information<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#Overview-and-a-little-task\" >Overview and a little task<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#1-Setting-up-our-dummy-table\" >1. Setting up our dummy table<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#2-Creating-and-uploading-a-virtual-environment\" >2. Creating and uploading a virtual environment<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#3-Writing-and-uploading-the-scripts\" >3. Writing and uploading the scripts<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#4-Writing-the-actual-HiveQL-query\" >4. Writing the actual HiveQL query<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#Finally\" >Finally<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#Get-in-touch\" >Get in touch<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#Were-hiring\" >We&#8217;re hiring<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"General-information\"><\/span>General information<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>To keep the idea behind <span class=\"caps\">UD<\/span>(A)Fs short, only some general notes are mentioned here. With the help of the <a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/LanguageManual+Transform\">Transform\/Map-Reduce syntax<\/a>, i.e. <code>TRANSFORM<\/code>, it is possible to plug in your own custom mappers and reducers. This is where we gonna hook in our Python script. A <span class=\"caps\">UDF<\/span> is basically only a transformation done by a mapper meaning that each row should be mapped to exactly one row. A <span class=\"caps\">UDAF<\/span> on the other hand allows us to transform a group of rows into one or more rows, meaning that we can reduce the number of input rows to a single output row by some custom aggregation.<\/p>\n<p>We can control if the script is run in a mapper or reducer step by the way we formulate our HiveQL query. The statements <code>DISTRIBUTE BY<\/code> and <code>CLUSTER BY<\/code> allow us to indicate that we want to actually perform an aggregation. HiveQL feeds the data to the Python script or any other custom script by using the standard input and reads the result from its standard out. All messages from standard error are ignored and can therefore be used for debugging. Since a <span class=\"caps\">UDAF<\/span> is more complex than a <span class=\"caps\">UDF<\/span> and actually can be seen as a generalization of it, the development of a <span class=\"caps\">UDAF<\/span> is demonstrated here.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Overview-and-a-little-task\"><\/span>Overview and a little task<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In order to not get lost in the details, here is what we want to achieve from a high-level perspective.<\/p>\n<ol>\n<li>Set up small example Hive table within some database.<\/li>\n<li>Create a virtual environment and upload it to Hive\u2019s distributed cache.<\/li>\n<li>Write the actual <span class=\"caps\">UDAF<\/span> as a Python script and a little helper shell script.<\/li>\n<li>Write a HiveQL query that feeds our example table into the Python script.<\/li>\n<\/ol>\n<p>Our dummy data consists of different types of vehicles (car or bike) and a price. For each category we want to calculate the mean and the standard deviation using\u00a0Pandas to keep things simple. It should not be necessary to mention that this task can be handled in HiveQL directly, so this is really only for demonstration purposes.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"1-Setting-up-our-dummy-table\"><\/span>1. Setting up our dummy table<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>With the following query we generate our sample data:<\/p>\n<pre class=\"lang:mysql decode:true \">CREATE DATABASE tmp;\r\n\r\nUSE tmp;\r\n\r\nCREATE TABLE foo (id INT, vtype STRING, price FLOAT);\r\n\r\nINSERT INTO TABLE foo VALUES (1, \"car\", 1000.);\r\n\r\nINSERT INTO TABLE foo VALUES (2, \"car\", 42.);\r\n\r\nINSERT INTO TABLE foo VALUES (3, \"car\", 10000.);\r\n\r\nINSERT INTO TABLE foo VALUES (4, \"car\", 69.);\r\n\r\nINSERT INTO TABLE foo VALUES (5, \"bike\", 1426.);\r\n\r\nINSERT INTO TABLE foo VALUES (6, \"bike\", 32.);\r\n\r\nINSERT INTO TABLE foo VALUES (7, \"bike\", 1234.);\r\n\r\nINSERT INTO TABLE foo VALUES (8, \"bike\", null);<\/pre>\n<p>Note that the last row even contains a null value that we need to handle later.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"2-Creating-and-uploading-a-virtual-environment\"><\/span>2. Creating and uploading a virtual environment<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In order to prepare a proper virtual environment we need to execute the following steps on an <span class=\"caps\">OS<\/span> that is binary compatible to the <span class=\"caps\">OS<\/span> on the Hive cluster. Typically any recent 64bit Linux distribution will do.<\/p>\n<p>We start by creating an empty virtual environment with:<\/p>\n<pre class=\"lang:sh decode:true \">virtualenv --no-site-packages -p \/usr\/bin\/python3 venv<\/pre>\n<p>assuming that <code>virtualenv<\/code> was already installed with the help of pip. Note that we explicitly ask for Python 3. Who uses Python 2 these days anyhow? We activate the virtual environment and install Pandas in it.<\/p>\n<pre class=\"lang:sh decode:true \">source venv\/bin\/activate\r\n\r\npip install numpy pandas<\/pre>\n<p>This should install Pandas and all its dependencies into our virtual environment. No we package the virtual environment for later deployment in the distributed cache:<\/p>\n<pre class=\"lang:sh decode:true \">cd venv\r\n\r\ntar cvfhz ..\/venv.tgz .\/\r\n\r\ncd ..<\/pre>\n<p>Be aware that the archive was created with the actual content at its root so when unpacking there will be no directory holding the actual content. We also used the parameter <code>h<\/code> to package linked files.<\/p>\n<p>Now we push the archive to <span class=\"caps\">HDFS<\/span> so that later Hive\u2019s data nodes will be able to find it:<\/p>\n<pre class=\"lang:sh decode:true \">hdfs dfs -put venv.tgz \/tmp<\/pre>\n<p>The directory <code>\/tmp<\/code> should be changed accordingly. One should also note that in principle the same procedure should be possible with conda environments. In practice though, it might be a bit more involved since the activation of a conda environment (that we need to do later) assumes an installation of at least miniconda which might not be available on the data nodes.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"3-Writing-and-uploading-the-scripts\"><\/span>3. Writing and uploading the scripts<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We start by writing a simple Python script <code>udaf.py<\/code>:<\/p>\n<pre class=\"lang:python decode:true \" title=\"Simple Python script\">import sys\r\n\r\nimport logging\r\n\r\nfrom itertools import groupby\r\n\r\nfrom operator import itemgetter\r\n\r\nimport numpy as np\r\n\r\nimport pandas as pd\r\n\r\nSEP = '\\t'\r\n\r\nNULL = '\\\\N'\r\n\r\n_logger = logging.getLogger(__name__)\r\n\r\ndef read_input(input_data):\r\n\r\n    for line in input_data:\r\n\r\n        yield line.strip().split(SEP)\r\n\r\ndef main():\r\n\r\n    logging.basicConfig(level=logging.INFO, stream=sys.stderr)\r\n\r\n    data = read_input(sys.stdin)\r\n\r\n    for vtype, group in groupby(data, itemgetter(1)):\r\n\r\n        _logger.info(\"Reading group {}...\".format(vtype))\r\n\r\n        group = [(int(rowid), vtype, np.nan if price == NULL else float(price))\r\n\r\n                 for rowid, vtype, price in group]\r\n\r\n        df = pd.DataFrame(group, columns=('id', 'vtype', 'price'))\r\n\r\n        output = [vtype, df['price'].mean(), df['price'].std()]\r\n\r\n        print(SEP.join(str(o) for o in output))\r\n\r\nif __name__ == '__main__':\r\n\r\n    main()<\/pre>\n<p>The script should be pretty much self-explanatory. We read from the standard input with the help of a generator that strips and splits the lines by the separator <code>\\t<\/code>. At any point we want to avoid to have more data in memory than\u00a0needed to perform the actual computation. We use the <code>groupby<\/code> function that is shipped with Python to iterate over our two types of vehicles. For each group we convert the read values to their respective data types and at that point also take care of <code>null<\/code> values which are encoded as <code>\\N<\/code>.<\/p>\n<p>After this preprocessing we finally feed everything into a Pandas dataframe, do our little mean and standard deviation calculations and print everything as a tabular separated list. It should also be noted that we set up a logger at the beginning which writes everything to standard error. This really helps a lot with debugging and should be used. For demonstration purposes the vehicle type of the group currently processed is printed.<\/p>\n<p>At this point we would actually be done if it wasn\u2019t for the fact that we are importing external libraries like Pandas. So if we ran this Python script directly as <span class=\"caps\">UDAF<\/span> we would see import errors if Pandas is not installed on all cluster nodes. But in the spirit of David Wheeler&#8217;s \u201cAll problems in computer science can be solved by another level of indirection.\u201c we just write a little helper script called <code>udaf.sh<\/code> that does this job for us and calls the Python script afterwards.<\/p>\n<pre class=\"lang:sh decode:true \">#!\/bin\/bash\r\n\r\nset -e\r\n\r\n(&gt;&amp;2 echo \"Begin of script\")\r\n\r\nsource .\/venv.tgz\/bin\/activate\r\n\r\n(&gt;&amp;2 echo \"Activated venv\")\r\n\r\n.\/venv.tgz\/bin\/python3 udaf.py\r\n\r\n(&gt;&amp;2 echo \"End of script\")<\/pre>\n<p>Again we use standard error to trace what the script is currently doing. With the help of <code>chmod u+x<\/code> we make the script executable and now all that\u2019s left is to push both files somewhere on <span class=\"caps\">HDFS<\/span> for the cluster to find:<\/p>\n<pre class=\"lang:sh decode:true \"> hdfs dfs -put udaf.py\/tmp\r\n\r\n hdfs dfs -put udaf.sh\/tmp\r\n\r\n<\/pre>\n<h2><span class=\"ez-toc-section\" id=\"4-Writing-the-actual-HiveQL-query\"><\/span>4. Writing the actual HiveQL query<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>After we are all prepared and set we can write the actual HiveQL query:<\/p>\n<pre class=\"lang:mysql decode:true \">DELETE ARCHIVE hdfs:\/\/\/tmp\/venv.tgz;\r\n\r\nADD ARCHIVE hdfs:\/\/\/tmp\/venv.tgz;\r\n\r\nDELETE FILE hdfs:\/\/\/tmp\/udaf.py;\r\n\r\nADD FILE hdfs:\/\/\/tmp\/udaf.py;\r\n\r\nDELETE FILE hdfs:\/\/\/tmp\/udaf.sh;\r\n\r\nADD FILE hdfs:\/\/\/tmp\/udaf.sh;\r\n\r\nUSE tmp;\r\n\r\nSELECT TRANSFORM(id, vtype, price) USING 'udaf.sh' AS (vtype STRING, mean FLOAT, var FLOAT)\r\n\r\n  FROM (SELECT * FROM foo CLUSTER BY vtype) AS TEMP_TABLE;<\/pre>\n<p>At first we add the zipped virtual environment to the distributed cache that will be automatically unpacked for us due to the <code>ADD ARCHIVE<\/code> command. Then we upload the Python and helper script. To make sure the current version in the cache is actually the latest, so in case changes are made, we prepended <code>DELETE<\/code> statements before each <code>ADD<\/code>. The actual query now calls <code>TRANSFORM<\/code> with the three input columns we expect in our Python script. After the <code>USING<\/code> statement our helper script is provided as the actual <span class=\"caps\">UDAF<\/span> seen by HiveQL. This is followed by <code>AS<\/code> defining the names and types of the output columns.<\/p>\n<p>At this point we need to make sure that the script is executed in a reducer step. We assure this by defining a subselect that reads from our <code>foo<\/code> table and clusters by the <code>vtype<\/code>. <code>CLUSTER BY<\/code> which is a shortcut for <code>DISTRIBUTE BY<\/code> followed by <code>SORT BY<\/code> asserts that rows having the same <code>vtype<\/code> column are also located on the same reducer. Furthermore, the implicit <code>SORT BY<\/code> orders within a reducer the rows with respect to the <code>vtype<\/code> column. The overall result are consecutive partitions of a given vehicle type (car and bike in our case) whereas each partition resides on a single reducer. Finally, our script runs in parallel and is fed the whole data of\u00a0a reducer but\u00a0needs to figure out itself where one partition ends and another one starts (what we did with <code>itertools.groupby<\/code>).<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Finally\"><\/span>Finally<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Since our little task is now accomplished, it should also be noted that there are some more Python libraries one should know when working with Hive. To actually execute the HiveQL query we have written with the help of Python, there is <a href=\"https:\/\/github.com\/cloudera\/impyla\">impyla<\/a> by Cloudera which\u00a0supports Python 3 in contrast to <a href=\"https:\/\/github.com\/dropbox\/PyHive\">PyHive<\/a> by Dropbox. In order to work with <span class=\"caps\">HDFS<\/span> the best library around is <a href=\"https:\/\/hdfs3.readthedocs.io\/\">hdfs3<\/a>,\u00a0that\u00a0for instance would allow us to push changes in <code>udaf.py<\/code> automatically with a Python script.<\/p>\n<p>Have fun hacking Hive with the power of Python!<\/p>\n<p>This article first appeared at <a href=\"http:\/\/www.florianwilhelm.info\/2016\/10\/python_udf_in_hive\/\" target=\"_blank\" rel=\"noopener\">www.florianwilhelm.info<\/a>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Get-in-touch\"><\/span>Get in touch<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Check out our <a href=\"https:\/\/www.inovex.de\/de\/leistungen\/analytics\/data-science\/\" target=\"_blank\" rel=\"noopener\">analytics portfolio on our website<\/a>. If you have any questions use the comment section below, write an Email to <a href=\"mailto:info@inovex.de\" target=\"_blank\" rel=\"noopener\">info@inovex.de<\/a> or call\u00a0<a href=\"tel:+497216190210\" target=\"_blank\" rel=\"noopener\">+49 721 619 021-0<\/a>.<\/p>\n<div style=\"margin: 7px; padding: 7px; border-left: 6px solid #9CCD00;\">\n<h2><span class=\"ez-toc-section\" id=\"Were-hiring\"><\/span>We&#8217;re hiring<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Are you a data scientist looking for new challenges? <a href=\"https:\/\/www.inovex.de\/de\/karriere\/stellenangebote\/?experience_code=&amp;department=data-management-analytics\" target=\"_blank\" rel=\"noopener\">We&#8217;re currently hiring<\/a>. As a student you might be interested in a position as student trainee or writing your thesis in Data Management &amp; Analytics.<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes the analytical power of built-in Hive functions is just not enough. In this case it is possible to write hand-tailored User-Defined Functions (UDFs) for transformations and even aggregations which are therefore called User-Defined Aggregation Functions (UDAFs). In this post we focus on how to write sophisticated UDFs and UDAFs in Python.<\/p>\n","protected":false},"author":52,"featured_media":12795,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[206,207],"service":[431],"coauthors":[{"id":52,"display_name":"Florian Wilhelm","user_nicename":"fwilhelm"}],"class_list":["post-21040","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-data-science","tag-hive","service-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Hive UDFs and UDAFs with Python - inovex GmbH<\/title>\n<meta name=\"description\" content=\"In this post we focus on how to write sophisticated User Defined (Aggregated) Functions (UD(A)Fs) for Apache Hive in Python.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Hive UDFs and UDAFs with Python - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"In this post we focus on how to write sophisticated User Defined (Aggregated) Functions (UD(A)Fs) for Apache Hive in Python.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2016-11-07T10:25:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-12-01T10:57:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2300\" \/>\n\t<meta property=\"og:image:height\" content=\"876\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Florian Wilhelm\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python-1024x390.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Florian Wilhelm\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"10\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Florian Wilhelm\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/\"},\"author\":{\"name\":\"Florian Wilhelm\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/57ad7c24ee7f9ec59ed87598c73fe79e\"},\"headline\":\"Hive UDFs and UDAFs with Python\",\"datePublished\":\"2016-11-07T10:25:39+00:00\",\"dateModified\":\"2022-12-01T10:57:33+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/\"},\"wordCount\":1458,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/hive-python.png\",\"keywords\":[\"Data Science\",\"Hive\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/\",\"name\":\"Hive UDFs and UDAFs with Python - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/hive-python.png\",\"datePublished\":\"2016-11-07T10:25:39+00:00\",\"dateModified\":\"2022-12-01T10:57:33+00:00\",\"description\":\"In this post we focus on how to write sophisticated User Defined (Aggregated) Functions (UD(A)Fs) for Apache Hive in Python.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/hive-python.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/hive-python.png\",\"width\":2300,\"height\":876,\"caption\":\"Hexagons surround the Python logo\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/hive-udfs-and-udafs-with-python\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Hive UDFs and UDAFs with Python\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/57ad7c24ee7f9ec59ed87598c73fe79e\",\"name\":\"Florian Wilhelm\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg5db1abe47435abb84b0b7484ce0890e9\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg\",\"caption\":\"Florian Wilhelm\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/fwilhelm\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Hive UDFs and UDAFs with Python - inovex GmbH","description":"In this post we focus on how to write sophisticated User Defined (Aggregated) Functions (UD(A)Fs) for Apache Hive in Python.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/","og_locale":"de_DE","og_type":"article","og_title":"Hive UDFs and UDAFs with Python - inovex GmbH","og_description":"In this post we focus on how to write sophisticated User Defined (Aggregated) Functions (UD(A)Fs) for Apache Hive in Python.","og_url":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2016-11-07T10:25:39+00:00","article_modified_time":"2022-12-01T10:57:33+00:00","og_image":[{"width":2300,"height":876,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python.png","type":"image\/png"}],"author":"Florian Wilhelm","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python-1024x390.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Florian Wilhelm","Gesch\u00e4tzte Lesezeit":"10\u00a0Minuten","Written by":"Florian Wilhelm"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/"},"author":{"name":"Florian Wilhelm","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/57ad7c24ee7f9ec59ed87598c73fe79e"},"headline":"Hive UDFs and UDAFs with Python","datePublished":"2016-11-07T10:25:39+00:00","dateModified":"2022-12-01T10:57:33+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/"},"wordCount":1458,"commentCount":1,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python.png","keywords":["Data Science","Hive"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/","url":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/","name":"Hive UDFs and UDAFs with Python - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python.png","datePublished":"2016-11-07T10:25:39+00:00","dateModified":"2022-12-01T10:57:33+00:00","description":"In this post we focus on how to write sophisticated User Defined (Aggregated) Functions (UD(A)Fs) for Apache Hive in Python.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2016\/11\/hive-python.png","width":2300,"height":876,"caption":"Hexagons surround the Python logo"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/hive-udfs-and-udafs-with-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Hive UDFs and UDAFs with Python"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/57ad7c24ee7f9ec59ed87598c73fe79e","name":"Florian Wilhelm","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg5db1abe47435abb84b0b7484ce0890e9","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg","caption":"Florian Wilhelm"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/fwilhelm\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21040","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/52"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=21040"}],"version-history":[{"count":2,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21040\/revisions"}],"predecessor-version":[{"id":33630,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21040\/revisions\/33630"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/12795"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=21040"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=21040"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=21040"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=21040"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}