Detailed explanation of Hive function and actual combat of case list

1. Built in functions of Hive system

1.1 numerical calculation function

1. Rounding function: round

Syntax: round(double a)
Return value: BIGINT
Note: returns the integer value part of double type (following rounding)

hive> select round(3.1415926) from tableName;
3
hive> select round(3.5) from tableName;
4
hive> create table tableName as select round(9542.158) from tableName;
2. Specified precision rounding function: round

Syntax: round(double a, int d)
Return value: DOUBLE
Description: returns the double type of the specified precision d

hive> select round(3.1415926,4) from tableName;
3.1416
3. Rounding down function: floor

Syntax: floor(double a)
Return value: BIGINT
Description: returns the largest integer equal to or less than the double variable

hive> select floor(3.1415926) from tableName;
3
hive> select floor(25) from tableName;
25
4. Round up function: ceil

Syntax: ceil(double a)
Return value: BIGINT
Description: returns the smallest integer equal to or greater than the double variable

hive> select ceil(3.1415926) from tableName;
4
hive> select ceil(46) from tableName;
46
5. Rounding up function: ceiling

Syntax: ceiling(double a)
Return value: BIGINT
Note: same function as ceil

hive> select ceiling(3.1415926) from tableName;
4
hive> select ceiling(46) from tableName;
46
6. Take random number function: rand

Syntax: rand(),rand(int seed)
Return value: double
Description: returns a random number in the range of 0 to 1. If seed is specified, it will wait for a stable sequence of random numbers

hive> select rand() from tableName;
0.5577432776034763
hive> select rand() from tableName;
0.6638336467363424
hive> select rand(100) from tableName;
0.7220096548596434
hive> select rand(100) from tableName;
0.7220096548596434

1.2 date function

1. UNIX timestamp to date function: from "unixtime

Syntax: from [unixtime (bigint unixtime [, string format])
Return value: string
Note: convert the time format of UNIX timestamp (seconds from 1970-01-01 00:00:00 UTC to the specified time) to the current time zone

hive> select from_unixtime(1323308943,'yyyyMMdd') from tableName;
20111208
2. Get the current UNIX timestamp function: Unix "timestamp

Syntax: Unix? Timestamp()
Return value: bigint
Description: get the UNIX timestamp of the current time zone

hive> select unix_timestamp() from tableName;
1323309615
3. Date to UNIX timestamp function: Unix "timestamp

Syntax: Unix? Timestamp (string date)
Return value: bigint
Description: converts the date to UNIX timestamp in the format "yyyy MM DD HH: mm: SS". If the conversion fails, 0 is returned.

hive> select unix_timestamp('2011-12-07 13:01:03') from tableName;
1323234063
4. Specify format date to UNIX timestamp function: Unix "timestamp

Syntax: unix_timestamp(string date, string pattern)
Return value: bigint
Description: converts the date in pattern format to UNIX timestamp. If the conversion fails, 0 is returned.

hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName;
1323234063
5. Date time to date function: to "date

Syntax: to "date (string timestamp)
Return value: string
Description: returns the date part of the date time field.

hive> select to_date('2011-12-08 10:03:01') from tableName;
2011-12-08
6. Date year function: year

Syntax: year(string date)
Return value: int
Description: returns the year in the date.

hive> select year('2011-12-08 10:03:01') from tableName;
2011
hive> select year('2012-12-08') from tableName;
2012
7. Date to month function: month

Syntax: month (string date)
Return value: int
Description: returns the month in the date.

hive> select month('2011-12-08 10:03:01') from tableName;
12
hive> select month('2011-08-08') from tableName;
8
8. Date transit function: day

Syntax: day (string date)
Return value: int
Description: returns the days in the date.

hive> select day('2011-12-08 10:03:01') from tableName;
8
hive> select day('2011-12-24') from tableName;
24
9. Date to hour function: hour

Syntax: hour (string date)
Return value: int
Description: returns the hour in the date.

hive> select hour('2011-12-08 10:03:01') from tableName;
10
10. Date to minute function: minute

Syntax: minute (string date)
Return value: int
Description: returns the minutes in the date.

hive> select minute('2011-12-08 10:03:01') from tableName;
3

hive> select second('2011-12-08 10:03:01') from tableName;
1
12. Date cycle function: weekofyear

Syntax: weekofyear (string date)
Return value: int
Note: returns the number of weeks the date is in.

hive> select weekofyear('2011-12-08 10:03:01') from tableName;
49
13. Date comparison function: datediff

Syntax: datediff(string enddate, string startdate)
Return value: int
Description: returns the number of days from the end date minus the start date.

hive> select datediff('2012-12-08','2012-05-09') from tableName;
213
14. Date add function: date [add]

Syntax: Date Ou add (string StartDate, int days)
Return value: string
Description: returns the date after the start date is increased by days.

hive> select date_add('2012-12-08',10) from tableName;
2012-12-18
15. Date reduction function: Date sub

Syntax: Date UU sub (string StartDate, int days)
Return value: string
Description: returns the date after the start date is reduced by days.

hive> select date_sub('2012-12-08',10) from tableName;
2012-11-28

1.3 condition function

1. If function: if

Syntax: if (Boolean TestCondition, t ValueTrue, t valuefalse or null)
Return value: T
Description: when the condition testCondition is TRUE, it returns valueTrue; otherwise, it returns valuefalse or null

hive> select if(1=2,100,200) from tableName;
200
hive> select if(1=1,100,200) from tableName;
100
2. Non empty lookup function: COALESCE

Syntax: COALESCE(T v1, T v2 )
Return value: T
Description: returns the first non NULL value in the parameter; if all values are NULL, then returns NULL

hive> select COALESCE(null,'100','50') from tableName;
100
3. Condition judgment function: CASE

Syntax: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
Return value: T
Note: if a is equal to b, then c is returned; if a is equal to d, then e is returned; otherwise, f is returned

hive> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
mary
hive> Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
tim
4. Condition judgment function: CASE

Syntax: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
Return value: T
Note: if a is TRUE, b is returned; if c is TRUE, d is returned; otherwise, e is returned

hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
mary
hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
tom

1.4 string function

1. String length function: length

Syntax: length(string A)
Return value: int
Description: returns the length of string A

hive> select length('abcedfg') from tableName;
2. String inversion function: reverse

Syntax: reverse(string A)
Return value: string
Description: returns the inversion result of string A

hive> select reverse('abcedfg') from tableName;
gfdecba
3. String connection function: concat

Syntax: concat(string A, string B )
Return value: string
Description: returns the result after connecting the input string. Any input string is supported

hive> select concat('abc','def','gh') from tableName;
abcdefgh
4. String connect and specify string separator: concat? WS

Syntax: concat \ WS (string SEP, string a, string B )
Return value: string
Description: returns the result after the input string is connected. SEP represents the separator between each string

hive> select concat_ws(',','abc','def','gh')from tableName;
abc,def,gh
5. String truncation function: substr

Syntax: substr(string A, int start),substring(string A, int start)
Return value: string
Description: returns the string A from start to end

hive> select substr('abcde',3) from tableName;
cde
hive> select substring('abcde',3) from tableName;
cde
hive>  select substr('abcde',-1) from tableName;  (and ORACLE Same)
e
6. String truncation function: substr,substring

Syntax: substr(string A, int start, int len),substring(string A, int start, int len)
Return value: string
Description: returns the string A starting from the start position, with the length of len

hive> select substr('abcde',3,2) from tableName;
cd
hive> select substring('abcde',3,2) from tableName;
cd
hive>select substring('abcde',-2,2) from tableName;
de
7. String capitalization function: upper,ucase

Syntax: upper(string A) ucase(string A)
Return value: string
Description: returns the uppercase format of string A

hive> select upper('abSEd') from tableName;
ABSED
hive> select ucase('abSEd') from tableName;
ABSED
8. String to lowercase function: lower,lcase

Syntax: lower(string A) lcase(string A)
Return value: string
Description: returns the lowercase format of string A

hive> select lower('abSEd') from tableName;
absed
hive> select lcase('abSEd') from tableName;
absed
9. Blank function: trim

Syntax: trim(string A)
Return value: string
Description: remove the spaces on both sides of the string

hive> select trim(' abc ') from tableName;
abc
10. Parse url

Syntax:
parse_url(string urlString, string partToExtract [, string keyToExtract])
Return value: string
Description: returns the specified part of the URL. Valid values for partToExtract are: HOST, PATH,
QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') 
from tableName;
www.tableName.com 
hive> select parse_url
('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1')
 from tableName;
v1
11. Get json object

Syntax: get JSON object (string JSON string, string path)
Return value: string
Description: parse the string json_string of JSON and return the content specified by path. NULL if the JSON string entered is invalid.

hive> select  get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;
12. Repeat string function: repeat

Syntax: repeat(string str, int n)
Return value: string
Description: return str string after n repetitions

hive> select repeat('abc',5) from tableName;
abcabcabcabcabc
13. Split string function: split

Syntax: split(string str, string pat)
Return value: array
Note: splitting str according to pat string will return the string array after splitting

hive> select split('abtcdtef','t') from tableName;
["ab","cd","ef"]

1.5 set statistical function

1. Number statistics function: count

Syntax: count(*), count(expr), count(DISTINCT expr[, expr])
Return value: Int

Note: count(*) counts the number of retrieved rows, including those with NULL values; count(expr) returns the number of non NULL values of the specified field; count(DISTINCT
expr[, expr]] returns the number of different non null values of the specified field

hive> select count(*) from tableName;
20
hive> select count(distinct t) from tableName;
10
2. sum statistic function: sum

Syntax: sum(col), sum(DISTINCT col)
Return value: double
Note: sum of col in sum(col) statistical result set; sum of col values in sum(DISTINCT col) statistical result set

hive> select sum(t) from tableName;
100
hive> select sum(distinct t) from tableName;
70
3. Average statistical function: avg

Syntax: avg(col), avg(DISTINCT col)
Return value: double
Note: the average value of col in the avg(col) statistical result set; the average value of the sum of different values of col in the avg(DISTINCT col) statistical result

hive> select avg(t) from tableName;
50
hive> select avg (distinct t) from tableName;
30
4. Statistical function of minimum value: min

Syntax: min(col)
Return value: double
Description: the minimum value of col field in the statistical result set

hive> select min(t) from tableName;
20
5. max statistical function: max

Syntax: maxcol)
Return value: double
Description: the maximum value of col field in the statistics result set

hive> select max(t) from tableName;
120

1.6 composite construction function

1. map type construction: map

Syntax: map (key1, value1, key2, value2 )
Description: build map type according to the input key and value pairs

create table score_map(name string, score map<string,int>)
row format delimited fields terminated by '\t' 
collection items terminated by ',' map keys terminated by ':';

Create the data content as follows and load the data
cd /kkb/install/hivedatas/
vim score_map.txt

zhangsan Mathematics: 80, Chinese: 89, English: 95
 lisi language: 60, mathematics: 80, English: 99

Load data into hive table
load data local inpath '/kkb/install/hivedatas/score_map.txt' overwrite into table score_map;

map structure data access:
Get all value s:
select name,map_values(score) from score_map;

Get all key s:
select name,map_keys(score) from score_map;

Get value value according to key
 select name,score ["Mathematics"] from score [map;

View the number of map elements
select name,size(score) from score_map;
2. Struct type construction: struct

Syntax: struct(val1, val2, val3 )
Note: according to the input parameters, construct the struct type, which is similar to the struct in C language. The internal data is obtained through X.X. suppose our data format is like this. There are 1254 people in ABC movie who have evaluated it and scored 7.4 points

Establish struct surface
hive> create table movie_score( name string,  info struct<number:int,score:float> )row format delimited fields terminated by "\t"  collection items terminated by ":"; 

//Loading data
cd /kkb/install/hivedatas/
vim struct.txt

ABC 1254:7.4  
DEF 256:4.9  
XYZ 456:5.4

//Loading data
load data local inpath '/kkb/install/hivedatas/struct.txt' overwrite into table movie_score;

hive Query data in
hive> select * from movie_score;  
hive> select info.number,info.score from movie_score;  
OK  
1254    7.4  
256     4.9  
456     5.4  
3. Array type construction: array

Syntax: array(val1, val2 )
Description: build array type according to input parameters

hive> create table  person(name string,work_locations array<string>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',';

Load data into person table
cd /kkb/install/hivedatas/
vim person.txt

The format of data content is as follows
biansutao   beijing,shanghai,tianjin,hangzhou
linan   changchu,chengdu,wuhan

Loading data
hive > load  data local inpath '/kkb/install/hivedatas/person.txt' overwrite into table person;

Query all data
hive > select * from person;

Query according to the following table index
hive > select work_locations[0] from person;

Query all collection data
hive  > select work_locations from person; 

Number of query elements
hive >  select size(work_locations) from person;   

1.7. Complex length statistical function

1.Map type length function: size (map < k.v >)

Syntax: size (map < k.v >)
Return value: int
Description: returns the length of map type

hive> select size(t) from map_table2;
2
2.array type length function: size (array < T >)

Syntax: size (array < T >)
Return value: int
Description: returns the length of array type

hive> select size(t) from arr_table2;
4
3. Type conversion function

Type conversion function: cast
Syntax: cast (expr as < type >)
Return value: Expected "=" to follow "type"
Description: returns the converted data type

hive> select cast('1' as bigint) from tableName;
1

1.8 expand function

1. Use the expand function to split the Map and Array field data in the hive table

lateral view is used with split, expand and other UDTF. It can split a row of data into multiple rows. On this basis, it can aggregate the split data. First, UDTF is called for each row of the original table. UDTF will split a row into one or more rows. lateral view combines the results to produce a virtual table that supports alias tables.
Expand can also be used to split the complex array or map structure in the hive column into multiple rows

Requirement: data format is as follows
zhangsan    child1,child2,child3,child4 k1:v1,k2:v2
lisi    child5,child6,child7,child8  k3:v3,k4:v4

Use \ tdivision between fields. All child ren need to be separated into one column 
+----------+--+
| mychild  |
+----------+--+
| child1   |
| child2   |
| child3   |
| child4   |
| child5   |
| child6   |
| child7   |
| child8   |
+----------+--+

Separate the key and value of the map, and the result is as follows

+-----------+-------------+--+
| mymapkey  | mymapvalue  |
+-----------+-------------+--+
| k1        | v1          |
| k2        | v2          |
| k3        | v3          |
| k4        | v4          |
+-----------+-------------+--+
Step 1: create hive database

Create hive database d

Step 1: create hive database

Create hive database d

hive (default)> create database hive_explode;
hive (default)> use hive_explode;
Step 2: create the hive table, and then use explode to split the map and array
create  table hive_explode.t3(name string,
children array<string>,
address Map<string,string>)
row format delimited fields terminated by '\t'  
collection items terminated by ','
map keys terminated by ':' 
stored as textFile;
Step 3: load data

node03 execute the following command to create a table data file

cd  /kkb/install/hivedatas/

vim maparray
//The format of data content is as follows

zhangsan    child1,child2,child3,child4 k1:v1,k2:v2
lisi    child5,child6,child7,child8 k3:v3,k4:v4

Load data in hive table

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/maparray' into table hive_explode.t3;
Step 4: use explode to separate the data in hive

Separate the data in the array

hive (hive_explode)> SELECT explode(children) AS myChild FROM hive_explode.t3;

Separate the data in the map

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM hive_explode.t3;
2. Using explode to split json strings

Requirements: now there are some data formats as follows:

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

Where the separator between fields is|

We need to parse all the corresponding values of monthSales into the following column (row to column)

4900
2090
6987
Step 1: create hive table
hive (hive_explode)> 
create table hive_explode.explode_lateral_view (
area string, 
goods_id string,
sale_info string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS textfile;
Step 2: prepare the data and load it

Prepare the data as follows

cd /kkb/install/hivedatas
vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

Load data into hive table

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/explode_json' overwrite into table hive_explode.explode_lateral_view;
Step 3: use explode to split the Array
hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from hive_explode.explode_lateral_view;
Step 4: use explode to disassemble the Map
hive (hive_explode)> select explode(split(area,',')) as area from hive_explode.explode_lateral_view;
Step 5: disassemble the json field
hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as  sale_info from hive_explode.explode_lateral_view;

Then we want to use get JSON object to get the data whose key is monthSales:

hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') as  sale_info from hive_explode.explode_lateral_view;
Then the exception failed: semanticexception [error 10081]: udtf's are not supported outside the select clause, nor needed in expressions
 Udtf expand cannot be written in other functions
 If you write this way, you want to check two fields, select expand (split (area, ')) as area, good_id from expand_late_view;
Failed: semanticexception 1:40 only a single expression in the select clause is supported with udtf's. error encrypted near token 'good' ID '
When UDTF is used, only one field is supported, and later view is required
3. Use with later view

Query multiple fields with the general view

hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2;

Among them, the later view expand (split (goods_id, ')) goods is equivalent to a virtual table, which is associated with the original table expand_later_view Cartesian product.

It can also be used multiple times

hive (hive_explode)> select goods_id2,sale_info,area2 from explode_lateral_view  LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;

It's also the result of Cartesian product of three tables

Finally, we can use the following sentence to completely transform a row of data in json format into a two-dimensional table

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source, get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales, get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,  get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales from explode_lateral_view   LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;

Conclusion:

In order to solve the problem that UDTF is not allowed in the select field, the later view usually appears together with UDTF.
Multiple linear view can realize Cartesian like product.
The Outer keyword can output the NULL result of UDTF that is not output to NULL to prevent data loss.

1.9 column row conversion function

1.9.1 column to row

1. Description of related functions

CONCAT(string A/col, string B/col… ): returns the result after connecting the input string. Any input string is supported;

CONCAT_WS(separator, str1, str2,...): it is a special form of CONCAT(). The separator between the remaining parameters of the first parameter. The separator can be the same string as the remaining parameters. If the separator is NULL, the return value will also be NULL. This function skips any NULL and empty strings after the delimiter parameter. The separator will be added between the connected strings;

Collect set (Col): the function only accepts basic data types. Its main function is to de summarize the values of a field and generate array type fields.

2. Data preparation

Table 6-6 data preparation

name constellation blood_type
Sun WuKong Aries A
Lao Wang Sagittarius A
Song and Song Dynasties Aries B
Zhu Bajie Aries A
Ice and ice Sagittarius A
3. demand

Group people with the same constellation and blood type. The results are as follows:

Sagittarius, A old king
 Aries, A monkey king
 Aries, B song song
4. Create local conservation.txt and import data

The node03 server executes the following command to create a file. Note that the data is divided by \ t

cd /kkb/install/hivedatas
vim constellation.txt
Monkey King Aries A
 Sagittarius A
 Aries B in song and Song Dynasties       
Pig Bajie Aries A
 Sagittarius A
5. Create hive table and import data

Create hive table and load data

hive (hive_explode)> create table person_info(  name string,  constellation string,  blood_type string)  row format delimited fields terminated by "\t";

Loading data

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/constellation.txt' into table person_info;
6. Query data by demand
hive (hive_explode)> select t1.base, concat_ws('|', collect_set(t1.name)) name from    (select name, concat(constellation, "," , blood_type) base from person_info) t1 group by  t1.base;

1.9.2 row to column

1. Function description

EXPLODE(col): split the complex array or map structure in the hive column into multiple rows.

LATERAL VIEW

Usage: later view udtf (expression) tablealias as columnalias

Explanation: used with split, expand and other UDTF, it can split a column of data into multiple rows of data, on this basis, the split data can be aggregated.

2. Data preparation
2. Data preparation

The data content is as follows. The fields are divided by using \ t

cd /kkb/install/hivedatas

vim movie.txt
<Suspect tracking suspense,action,science fiction,Plot
<Lie to me> Suspense,gangster,action,Psychology,Plot
<Wolf 2 War,action,disaster
3. demand

Expand the array data in the movie category. The results are as follows:

Suspect tracking suspense
 Suspect tracking action
 Suspect tracking science fiction
 The plot of "suspect tracking"
"Lie to me" suspense
 Lie to me police
 "Lie to me" action
 The psychology of Lie to me
 The plot of Lie to me
 Wolf 2 War
 Action of wolf 2
 Wolf 2 disaster
4. Create hive table and import data

Create hive table

hive (hive_explode)> create table movie_info(
movie string, 
category array<string>
) 
row format delimited fields terminated by "\t" 
collection items terminated by ",";

Loading data

load data local inpath "/kkb/install/hivedatas/movie.txt" into table movie_info;
5. Query data by demand
hive (hive_explode)>  
select movie, category_name 
from 
movie_info lateral view explode(category) table_tmp as category_name;

1.10. reflect function

The reflect function can support the calling function in java in sql.

Use Max in java.lang.Math to find the maximum value in two columns

Create hive table

Create hive table

hive (hive_explode)>  
create table test_udf(col1 int,col2 int)
row format delimited fields terminated by ',';

Prepare data and load data

cd /kkb/install/hivedatas

vim test_udf

1,2
4,3
6,4
7,5
5,6

Loading data

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/test_udf' overwrite into table test_udf;

Use Max in java.lang.Math to find the maximum value in two columns

hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf;
Different records perform different java built-in functions

Create hive table

hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ',';

Preparation data

cd /export/servers/hivedatas

vim test_udf2

java.lang.Math,min,1,2
java.lang.Math,max,2,3

Loading data

hive (hive_explode)> load data local inpath '/kkb/install/hivedatas/test_udf2' overwrite into table test_udf2;

Execution query

hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;
Judge whether it is a number

Using the functions in apache commons, the jar under Commons is already included in the classpath of hadoop, so it can be used directly.

Use as follows:

hive (hive_explode)> select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123");

1.11 analysis function

1. Introduction to the function of analysis function

For some complex data retrieval processes, we may need to use analysis functions, which are mainly used to group topN, or calculate percentage, or slice data, and so on. We can use analysis functions to solve these problems

2. Introduction of common analysis functions

1,ROW_NUMBER():

Starting from 1, generate the sequence of records in the group in order. For example, arrange them in pv descending order to generate the pv number of each day in the group. There are many application scenarios for row ﹐ number(), for example, get the first sorted record in the group, get the first refer in a session, etc.

2,RANK() :

Generate the ranking of data items in the group. If the ranking is equal, the vacancy will be left in the ranking

3,DENSE_RANK() :

Generate the ranking of data items in the group. If the ranking is equal, no vacancy will be left in the ranking

4,CUME_DIST :

Rows less than or equal to the current value / total rows within the group. For example, count the number of people less than or equal to the current salary as a percentage of the total number of people

5,PERCENT_RANK :

RANK value of the current line in the group / total number of lines in the group

6,NTILE(n) :

It is used to cut the grouping data into n slices in order and return the current slice value. If the slices are not uniform, the distribution of the first slice will be increased by default. Ntle does not support ROWS BETWEEN, such as ntle (2) over (partition by cookie order by createtime ROWS BETWEEN 3 warning and current row).

3. Requirement description

The format of existing data content is as follows, which corresponds to three fields, cookie ID, createtime, pv. To obtain the data records of the top three of each cookie accessing pv is to find the top n in groups and the first three values in each group

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7
Step 1: create database tables

Create database tables in hive

CREATE EXTERNAL TABLE cookie_pv (
cookieid string,
createtime string, 
pv INT
) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' ;
Step 2: prepare the data and load it

node03 execute the following command to create data and load it into the hive table

cd /kkb/install/hivedatas
vim cookiepv.txt

cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

Load data into hive table

load  data  local inpath '/kkb/install/hivedatas/cookiepv.txt'  overwrite into table  cookie_pv 
Step 3: use the analysis function to get the first three records of each cookie accessing PV
SELECT 
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
FROM cookie_pv 
WHERE rn1 <=  3 ;

2. Hive custom function

2.1 basic introduction of user defined function

1) Hive comes with some functions, such as max/min, but the number is limited. You can easily extend it by customizing UDF.

2) When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined function s.

3) According to the user-defined function categories, there are three types:

​ (1)UDF(User-Defined-Function)

One in one out

​ (2)UDAF(User-Defined Aggregation Function)

Aggregate function, multiple in and one out

Similar to: count/max/min

​ (3)UDTF(User-Defined Table-Generating Functions)

One in and more out

For example, later view expand ()

4) Official document address

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5) Programming steps:

(1) inherit org.apache.hadoop.hive.ql.UDF

(2) implement the evaluate function, which supports overloading;

6) Precautions

(1) UDF must have a return type, which can return null, but the return type cannot be void;

(2) Text/LongWritable and other types are commonly used in UDF, and java type is not recommended;

2.2 user defined function development

1. Basic introduction of custom function

1) Hive comes with some functions, such as max/min, but the number is limited. You can easily extend it by customizing UDF.

2) When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined function s.

3) According to the user-defined function categories, there are three types:

​ (1)UDF(User-Defined-Function)

One in one out

​ (2)UDAF(User-Defined Aggregation Function)

Aggregate function, multiple in and one out

Similar to: count/max/min

​ (3)UDTF(User-Defined Table-Generating Functions)

One in and more out

For example, later view expand ()

4) Official document address

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5) Programming steps:

(1) inherit org.apache.hadoop.hive.ql.UDF

(2) implement the evaluate function, which supports overloading;

6) Precautions

(1) UDF must have a return type, which can return null, but the return type cannot be void;

(2) Text/LongWritable and other types are commonly used in UDF, and java type is not recommended;

2. Custom function development
Step 1: create maven java project and import jar package
<repositories>
    <repository>
        <id>cloudera</id>
 <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0-cdh5.14.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>1.1.0-cdh5.14.2</version>
    </dependency>
</dependencies>
<build>
<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.0</version>
        <configuration>
            <source>1.8</source>
            <target>1.8</target>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-shade-plugin</artifactId>
         <version>2.2</version>
         <executions>
             <execution>
                 <phase>package</phase>
                 <goals>
                     <goal>shade</goal>
                 </goals>
                 <configuration>
                     <filters>
                         <filter>
                             <artifact>*:*</artifact>
                             <excludes>
                                 <exclude>META-INF/*.SF</exclude>
                                 <exclude>META-INF/*.DSA</exclude>
                                 <exclude>META-INF/*/RSA</exclude>
                             </excludes>
                         </filter>
                     </filters>
                 </configuration>
             </execution>
         </executions>
     </plugin>
</plugins>
</build>
Step 2: develop java classes to inherit UDF and overload the evaluate method
public class MyUDF extends UDF {
     public Text evaluate(final Text s) {
         if (null == s) {
             return null;
         }
         //**Return to uppercase         
         return new Text(s.toString().toUpperCase());
     }
 }
Step 3: package our project and upload it to hive's lib directory

Use maven's package to package and upload our packed jar package to node03 server under the path of / kkb/install/hive-1.1.0-cdh5.14.2/lib

Step 4: add our jar package

Rename our jar package name

cd /kkb/install/hive-1.1.0-cdh5.14.2/lib
mv original-day_hive_udf-1.0-SNAPSHOT.jar udf.jar

hive's client adds our jar package

0: jdbc:hive2://node03:10000> add jar /kkb/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;
Step 5: set function association with our custom function
0: jdbc:hive2://node03:10000> create temporary function tolowercase as 'com.kkb.udf.MyUDF';
Step 6: use custom functions
0: jdbc:hive2://node03:10000>select tolowercase('abc');

How to create permanent function in hive

To add a temporary function to hive, we need to add the following every time we enter the hive client. The temporary function will fail when exiting the hive client. Then we can also create a permanent function to keep it from failing

Create permanent function

1,Specify the database, and create our function under the specified database
0: jdbc:hive2://node03:10000>use myhive;

2,Use add jar Add our jar Package to hive Among them
0: jdbc:hive2://node03:10000>add jar /kkb/install/hive-1.1.0-cdh5.14.2/lib/udf.jar;

3,See all of the jar package
0: jdbc:hive2://node03:10000>list  jars;

4,Create permanent functions to associate with our functions
0: jdbc:hive2://node03:10000>create  function myuppercase as 'com.kkb.udf.MyUDF';

5,View our permanent functions
0: jdbc:hive2://node03:10000>show functions like 'my*';

6,Use permanent functions
0: jdbc:hive2://node03:10000>select myhive.myuppercase('helloworld');

7,Delete permanent function
0: jdbc:hive2://node03:10000>drop function myhive.myuppercase;

8,View function
 show functions like 'my*';

Tags: Big Data hive Unix Java JDBC

Posted on Wed, 05 Feb 2020 04:14:30 -0800 by AndrewBacca