Hive之窗口函数

普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。
因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每行都返回一个值。
简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。 开窗函数一般分为两类,聚合开窗函数和排序开窗函数。

OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
LAG(col,n):往前第n行数据
LEAD(col,n):往后第n行数据
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。

测试数据:

name,orderdate,cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

需求:
(1)查询在2017年4月份购买过的顾客及总人数
(2)查询顾客的购买明细及月购买总额
(3)上述的场景,要将cost按照日期进行累加
(4)查询顾客上次的购买时间
(5)查询前20%时间的订单信息

建表、导数:

create table business(name string, orderdate string, cost int)
row format delimited fields terminated by ‘,‘;

load data local inpath "/opt/data/business.txt" into table business;

(1)查询在2017年4月份购买过的顾客及总人数

select
name, count(*) as cnt
from business where date_format(orderdate,‘yyyy-MM‘)=‘2017-04‘
group by name;

Hive之窗口函数

(2)查询顾客的购买明细及月购买总额

select
name,
orderdate,
cost,
sum(cost) over (partition by month(orderdate)) as month_sum
from business order by name,orderdate;

Hive之窗口函数

(3)上述的场景,要将cost按照日期进行累加

select
name,orderdate,cost,
sum(cost) over (partition by name order by orderdate rows between unbounded preceding and current row) as accumulate_cost,
--按name分组,按orderdate排序,求从组内首行到当前行之和

sum(cost) over(partition by name order by orderdate) as accumulate_cost2,
--效果和上个一样

sum(cost) over () as total_cost,
--所有cost求和

sum(cost) over (partition by name) as total_cost_by_person,
--每个name组内之和

sum(cost) over (partition by name order by orderdate rows between 1 preceding and current row) as double_cost,
--组内按orderdate排序,前一行和当前行之和

sum(cost) over (partition by name order by orderdate rows between 1 preceding and
1 following) as triple_cost,
--组内按orderdate排序,前一行、当前行、下一行,三行之和

sum(cost) over (partition by name order by orderdate rows between current row and
unbounded following) as total_next_cost
--组内按orderdate排序,当前行至组内尾行之和
from business;

Hive之窗口函数

(4)查询顾客上次的购买时间

select
name,
orderdate,
lag(orderdate,1) over(partition by name order by orderdate) as time2
from business;

Hive之窗口函数

(5)查询前20%时间的订单信息,

ntile(n)用于将分组数据按照顺序切分成n片,返回当前记录所在的切片值,经常用来取前30% 带有百分之多少比例的记录什么的。

select name,orderdate,cost,
ntile(5) over (order by orderdate) sorted
from business;

Hive之窗口函数

 查询20%即sorted=1

select t.* from (
select
  name,
  orderdate,
  cost,
  ntile(5) over (order by orderdate) sorted
from business
) t
where t.sorted=1;

Hive之窗口函数

 扩展:
求前40%时间和后60%时间内的平均cost

--<1>按照orderdate升序拆分成5分

select
name,
orderdate,
cost,
ntile(5) over (order by orderdate) as prt
from business;

Hive之窗口函数

 --<2>前2份prt标识为1,是前40%的时间;后3份prt标识为2,是后60%的时间

select 
t.name,
t.orderdate,
t.cost,
case when prt in (1,2) then 1 else 2 end as new_prt
from(
    select 
    name,
    orderdate,
    cost,
    ntile(5) over (order by orderdate) as prt
    from business
    ) t;

Hive之窗口函数

--<3>--取每个new_prt的平均值

select 
new_prt,
max(case when new_prt=1 then "first_40%_avg_cost" when new_prt=2 then "last_60%_avg_cost" end) as prt_cost_name,
cast(avg(cost)as decimal(10,2)) as avg_cost
from(
    select 
    t.name,
    t.orderdate,
    t.cost,
    case when prt in (1,2) then 1 else 2 end as new_prt
    from(
        select 
        name,
        orderdate,
        cost,
        ntile(5) over (order by orderdate) as prt
        from business
         ) t
    ) q
group by q.new_prt;

Hive之窗口函数