Hive User Defined Table Generating Functions (UDTF) Java Example

posted on Nov 20th, 2016

Apache Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. The traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over a distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API without the need to implement queries in the low-level Java API. Since most of the data warehousing application work with SQL based querying language, Hive supports easy portability of SQL-based application to Hadoop.

Hive User Defined Table Generating Functions

Step 1 - Add these jar files to your java project.

hive-exe*.jar

$HIVE_HOME/lib/*.jar
$HADOOP_HOME/share/hadoop/mapreduce/*.jar
$HADOOP_HOME/share/hadoop/common/*.jar

Myudtf.java

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
public class Myudtf extends GenericUDTF {
	private PrimitiveObjectInspector stringOI = null;
	@Override
	public StructObjectInspector initialize(ObjectInspector[] args)
			throws UDFArgumentException {
		if (args.length != 1) {
			throw new UDFArgumentException(
					"NameParserGenericUDTF() takes exactly one argument");
		}
		if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE
				&& ((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
			throw new UDFArgumentException(
					"NameParserGenericUDTF() takes a string as a parameter");
		}
		// input inspectors
		stringOI = (PrimitiveObjectInspector) args[0];
		// output inspectors -- an object with three fields!
		List<String> fieldNames = new ArrayList<String>(2);
		List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2);
		fieldNames.add("id");
		fieldNames.add("phone_number");
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		return ObjectInspectorFactory.getStandardStructObjectInspector(
				fieldNames, fieldOIs);
	}

	public ArrayList<Object[]> processInputRecord(String id) {
		ArrayList<Object[]> result = new ArrayList<Object[]>();
		// ignoring null or empty input
		if (id == null || id.isEmpty()) {
			return result;
		}
		String[] tokens = id.split("\\s+");
		if (tokens.length == 2) {
			result.add(new Object[] { tokens[0], tokens[1] });
		}
		else if (tokens.length == 3) {
			result.add(new Object[] { tokens[0], tokens[1] });
			result.add(new Object[] { tokens[0], tokens[2] });
		}
		return result;
	}
	@Override
	public void process(Object[] record) throws HiveException {
		final String id = stringOI.getPrimitiveJavaObject(record[0]).toString();
		ArrayList<Object[]> results = processInputRecord(id);
		Iterator<Object[]> it = results.iterator();
		while (it.hasNext()) {
			Object[] r = it.next();
			forward(r);
		}
	}
	@Override
    public void close() throws HiveException {
  // do nothing
    }
}

	

Step 2 - Compile and create a jar file of your java project. Creating a jar file is left to you.

Step 3 - Create a phn_num.txt file

phn_num.txt

Step 4 - Add these following lines to phn_num.txt file. Save and close.

123,phone1,phone2
123,phone3
124,phone1,phone2
125,phone1,phone2
125,phone3
126,phone1
126,phone2,phone3

Step 5 - Change the directory to /usr/local/hive/bin

$ cd $HIVE_HOME/bin

Step 6 - Enter into hive shell

$ hive

Step 7 - Create a table phone, load phn_num.txt data into the table and verify. Save and close.

hive> CREATE TABLE phone(id String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n';

hive> LOAD DATA LOCAL INPATH '/home/hduser/Desktop/HIVE/phn_num.txt' OVERWRITE INTO TABLE phone;

hive> SELECT * FROM phone;

Step 8 - Add jar file in distributed cache, create a function and execute udtf function.

hive> ADD JAR /home/hduser/Desktop/HIVE/UDTF.jar;

hive> CREATE TEMPORARY FUNCTION fun2 AS 'Myudtf';

hive> SELECT fun2(id) FROM phone;

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Hive Installation With Derby Database Metastore   Hive Installation With MySQL Database Metastore   Beeline Client Usage   hiveserver2 and Web UI usage   WordCount hiveQL Execution   Hive Metastore Configuration   Hive Command Line Interface   Hive Shell Commands usage   Hive Distributed Cache   HDFS and Linux Commands in hive shell   Customizing hive logs   Database Commnds Usage   Table Commands Usage   Hive Partitioning Configuration   Hive Bucketing Configuration   UDFs Java Example   UDAFs Java Example   Hive JDBC client Java Example   Hive Web Interface (HWI)   HiveQL Examples