Sometimes, it is necessary to extract the next row in the query function to perform computation between rows.
Hive 0.11.x version has included a Windowing and Analytics module with Java map-reduce UDFs. However, there is another possible approach to replicate the lead function in Hive using the custom reduce script.
The approach is simple in the way that we could force the entire mapped data to forced into a single reducer for each key, value pair using the following hive script.
use package; ADD FILE ./reducer.py; drop table lead_table; set mapred.reduce.tasks=1; create table lead_table as select transform (a.column1, a.column2) using 'python reducer.py' as a_column1, a_column2, lead_column3 from (select column1, column2 from source_table cluster by column1 ) a;
There are two tricks here:
- We skip the map phase altogether by using SELECT TRANSFORM and not MAP USING
- cluster by column1 – enforces the key-value pair to go into one single reducer.
Caveat: More than one key may end up in the same reducer. So, the below program has to be modified to account for switch in the keys.
Below is the reducer python script.
#!/usr/bin/python import sys import ast from datetime import datetime count = 0 for line_out in sys.stdin: str1 =  str =  line_split = line_out.strip().split('\t') if count == 0: # prev_time = datetime.strptime(line_split.split("."),'%Y-%m-%d %H:%M:%S') # print time.strftime("%b %d %Y %H:%M:%S", time.gmtime(prev_time)) # Ignore the first line count +=1 # retain the first timestamp prev_time = line_split elif count >= 1: # print the key str.append(line_split) str.append('\t') # Print the start time str.append(prev_time) str.append('\t') # Print the end time str.append(line_split) print "".join(str) # retain the current timestamp as the start timestamp prev_time = line_split # print the last line str =  str.append(line_split) str.append('\t') str.append(line_split) str.append('\t') str.append('') print "".join(str)