hive开窗开窗函数进阶

成长之路

2020-06-25

关注关注

概述

开窗函数的理解参见: 理解hive中的开窗函数

over()中除了可以使用partition by选择分组字段外, 还有以下函数

order by 排序
指定聚合行的范围, 配合order by使用
- current row：当前行
- n PRECEDING：往前 n 行数据
- n FOLLOWING：往后 n 行数据
- UNBOUNDED PRECEDING 表示从前面的起点
- UNBOUNDED FOLLOWING 表示到后面的终点
生成新的列
- LAG(col,n)：往后第 n 行数据
- LEAD(col,n)：往前第 n 行数据
- NTILE(n)：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从 1 开始，对于每一行， NTILE 返回此行所属的组的编号。注意： n 必须为 int 类型。
生成新的列: Rank
- RANK()排序相同时会重复，总数不会变
- DENSE_RANK() 排序相同时会重复，总数会减少
- ROW_NUMBER() 会根据顺序计算

在over()中使用排序

order by

排序

未排序

hive (default)> select name,orderdate,cost, sum(cost) over(partition by name) as sample from business;

name    orderdate    cost    sample
jack    2017-01-01    10    176
jack    2017-02-03    23    176
jack    2017-01-05    46    176
jack    2017-04-06    42    176
jack    2017-01-08    55    176
mart    2017-04-13    94    299
mart    2017-04-08    62    299
mart    2017-04-09    68    299
mart    2017-04-11    75    299

排序

hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate) as sample3 from business;

name    orderdate    cost    sample3
jack    2017-01-01    10    10
jack    2017-01-05    46    56
jack    2017-01-08    55    111
jack    2017-02-03    23    134
jack    2017-04-06    42    176
mart    2017-04-08    62    62
mart    2017-04-09    68    130
mart    2017-04-11    75    205
mart    2017-04-13    94    299

指定开窗聚合的行

`rows between UNBOUNDED PRECEDING and CURRENT ROW`

从起行到当前行的聚合, 而不是对行组内的所有行进行聚合, 配合order by使用

hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample from business;

name    orderdate    cost    sample
jack    2017-01-01    10    10
jack    2017-01-05    46    56
jack    2017-01-08    55    111
jack    2017-02-03    23    134
jack    2017-04-06    42    176
mart    2017-04-08    62    62
mart    2017-04-09    68    130
mart    2017-04-11    75    205
mart    2017-04-13    94    299
neil    2017-05-10    12    12
neil    2017-06-12    80    92
tony    2017-01-02    15    15
tony    2017-01-04    29    44
tony    2017-01-07    50    94

`rows between 1 PRECEDING and current row`

将上一行与本行聚合, 而不是将行组内所有行聚合

hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample  from business;

jack    2017-01-01    10    10
jack    2017-01-05    46    56
jack    2017-01-08    55    101
jack    2017-02-03    23    78
jack    2017-04-06    42    65
mart    2017-04-08    62    62
mart    2017-04-09    68    130
mart    2017-04-11    75    143
mart    2017-04-13    94    169
neil    2017-05-10    12    12
neil    2017-06-12    80    92
tony    2017-01-02    15    15
tony    2017-01-04    29    44
tony    2017-01-07    50    79

rows between 1 PRECEDING AND 1 FOLLOWING

当前行和前一行及后面一行聚合

rows between current row and UNBOUNDED FOLLOWING

当前行及后面所有行

rows between current row and UNBOUNDED FOLLOWING

当前行及后面所有行

生成新的列: LAG, LEAD, NTILE

`LAG(col, n)`

第三个参数用于代替"NLL"

select
    name,orderdate,cost,
    lag(orderdate,1,‘1900-01-01‘) over(partition by name order by orderdate ) as time1,
    lag(orderdate,2) over (partition by name order by orderdate) as time2
from
    business;

结果: time1将orderdate列往后移一行, time2往后移两行

name    orderdate    cost    time1    time2
jack    2017-01-01    10    1900-01-01    NULL
jack    2017-01-05    46    2017-01-01    NULL
jack    2017-01-08    55    2017-01-05    2017-01-01
jack    2017-02-03    23    2017-01-08    2017-01-05
jack    2017-04-06    42    2017-02-03    2017-01-08
mart    2017-04-08    62    1900-01-01    NULL
mart    2017-04-09    68    2017-04-08    NULL
mart    2017-04-11    75    2017-04-09    2017-04-08
mart    2017-04-13    94    2017-04-11    2017-04-09
neil    2017-05-10    12    1900-01-01    NULL
neil    2017-06-12    80    2017-05-10    NULL
tony    2017-01-02    15    1900-01-01    NULL
tony    2017-01-04    29    2017-01-02    NULL
tony    2017-01-07    50    2017-01-04    2017-01-02

`LEAD(col, n)`

同时, 只是往前移指定的行数

`NTILE(n)`

需要配合order by使用

select * from (
    select
        name,orderdate,cost,
        ntile(5) over() sorted
    from
        business
) t;

结果: ntitle(5), 增加一列(组标签), 用于分组, 这里为5组

t.name    t.orderdate    t.cost    t.sorted
jack    2017-01-01    10    1
tony    2017-01-02    15    1
jack    2017-02-03    23    1
tony    2017-01-04    29    2
jack    2017-01-05    46    2
jack    2017-04-06    42    2
tony    2017-01-07    50    3
jack    2017-01-08    55    3
mart    2017-04-08    62    3
mart    2017-04-09    68    4
neil    2017-05-10    12    4
mart    2017-04-11    75    4
neil    2017-06-12    80    5
mart    2017-04-13    94    5

如果分组是行数数以n有余数, 则从上到下每组增加一行, 直到所以行都有组标签.

生成新的列: `Rank`

RANK() 排序相同时会重复，总数不会变
DENSE_RANK() 排序相同时会重复，总数会减少
ROW_NUMBER() 会根据顺序计算

select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;
name subject score rp drp rmp
sk 数学 95 1 1 1
zs 数学 86 2 2 2
ls 数学 85 3 3 3
ww 数学 56 4 4 4

zs 英语 84 1 1 1
ww 英语 84 1 1 2
ls 英语 78 3 2 3
sk 英语 68 4 3 4

ww 语文 94 1 1 1
sk 语文 87 2 2 2
ls 语文 65 3 3 3
zs 语文 64 4 4 4

hive