Hive User Defined Aggregate Functions (UDAF) Java Example

posted on Nov 20th, 2016

Apache Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. The traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over a distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API without the need to implement queries in the low-level Java API. Since most of the data warehousing application work with SQL based querying language, Hive supports easy portability of SQL-based application to Hadoop.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

3) Apache Hive 2.1.0 pre installed (How to Install Hive on Ubuntu 14.04)

User Defined Aggregate Functions (UDAF) Java Example

Step 1 - Add these jar files to your java project.

hive-exe*.jar

$HIVE_HOME/lib/*.jar
$HADOOP_HOME/share/hadoop/mapreduce/*.jar
$HADOOP_HOME/share/hadoop/common/*.jar

Max.java

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
@SuppressWarnings("deprecation")
public class Max extends UDAF {
	public static class MaxIntUDAFEvaluator implements UDAFEvaluator {
		private IntWritable output;
		public void init()
		{
			output = null;
		}
		public boolean iterate(IntWritable maxvalue) // Process input table
		{
			if (maxvalue == null)
			{
				return true;
			}
			if (output == null)
			{
				output = new IntWritable(maxvalue.get());
			}
			else
			{
				output.set(Math.max(output.get(), maxvalue.get()));
			}
			return true;
		}
		public IntWritable terminatePartial()
		{
			return output;
		}
		public boolean merge(IntWritable other)
		{
			return iterate(other);
		}
		public IntWritable terminate() // final result
		{
			return output;
		}
	}
}

Step 2 - Compile and create a jar file of your java project. Creating a jar file is left to you.

Step 3 - Create a Numbers_List.txt file

Numbers_List.txt

Step 4 - Add these following lines to Numbers_List.txt file

10
12
23
55
66
77
88
99
22
13
16

Step 5 - Change the directory to /usr/local/hive/bin

$ cd $HIVE_HOME/bin

Step 6 - Enter into hive shell

$ hive

Step 7 - Create a table Num_list, load Numbers_List.txt data into the table and verify. Save and close.

hive> CREATE TABLE Num_list(Num int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n';

hive> LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/Numbers_List.txt' OVERWRITE INTO TABLE Num_list;

hive> SELECT * FROM Num_list;

Step 8 - Add jar file in distributed cache, create a function and execute udaf function.

hive> ADD JAR /home/hduser/Desktop/HIVE/MaxUDAF.jar;

hive> CREATE TEMPORARY FUNCTION max AS 'Max';

hive> SELECT max(Num) FROM Num_list;

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Hive Installation With Derby Database Metastore   Hive Installation With MySQL Database Metastore   Beeline Client Usage   hiveserver2 and Web UI usage   WordCount hiveQL Execution   Hive Metastore Configuration   Hive Command Line Interface   Hive Shell Commands usage   Hive Distributed Cache   HDFS and Linux Commands in hive shell   Customizing hive logs   Database Commnds Usage   Table Commands Usage   Hive Partitioning Configuration   Hive Bucketing Configuration   UDFs Java Example   UDTF Java Example   Hive JDBC client Java Example   Hive Web Interface (HWI)   HiveQL Examples