hive sort by order by

selecta.*frompokesasortbya.foodesc;

http://blog.sina.com.cn/s/blog_6ff05a2c0101eaxf.html

在hive中不光有orderby操作,还有个sortby操作。两者执行的都是排序的操作,但有存在很大的不同。

还是用上次orderby的例子来说明。

测试用例

hive>select*fromtest09;

OK

100tom

200mary

300kate

400tim

Timetaken:0.061seconds

hive>select*fromtest09sortbyid;

TotalMapReducejobs=1

LaunchingJob1outof1

Numberofreducetasksnotspecified.Defaultingtojobconfvalueof:2

Inordertochangetheaverageloadforareducer(inbytes):

sethive.exec.reducers.bytes.per.reducer=

Inordertolimitthemaximumnumberofreducers:

sethive.exec.reducers.max=

Inordertosetaconstantnumberofreducers:

setmapred.reduce.tasks=

StartingJob=job_201105020924_0068,TrackingURL=http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0068

KillCommand=/home/hjl/hadoop/bin/../bin/hadoopjob-Dmapred.job.tracker=hadoop00:9001-killjob_201105020924_0068

2011-05-0305:39:21,389Stage-1map=0%,reduce=0%

2011-05-0305:39:23,410Stage-1map=50%,reduce=0%

2011-05-0305:39:25,430Stage-1map=100%,reduce=0%

2011-05-0305:39:30,470Stage-1map=100%,reduce=50%

2011-05-0305:39:32,493Stage-1map=100%,reduce=100%

EndedJob=job_201105020924_0068

OK

100tom

300kate

200mary

400tim

Timetaken:17.783seconds

结果看起来和orderby差不多,但是sortby是不受hive.mapred.mode参数影响,无论hive.mapred.mode在什么模式都可以。

从上面的Numberofreducetasksnotspecified.Defaultingtojobconfvalueof:2可以看得出来,此时共启动了2个reduce。

实际上sortby控制的是每个reduce产生的文件都是排序的(从上面的结果可以看出,整体上并不保证有序),这样对多个已经排序好的文件做一次归并排序就ok了。

比用orderby的时候,仅仅有单个reduce要好得多。

我们把上面的结果写到文件中就看得清楚的多了。

hive>insertoverwritelocaldirectory‘/home/hjl/sunwg/qqq’select*fromtest09sortbyid;

TotalMapReducejobs=1

LaunchingJob1outof1

Numberofreducetasksnotspecified.Defaultingtojobconfvalueof:2

Inordertochangetheaverageloadforareducer(inbytes):

sethive.exec.reducers.bytes.per.reducer=

Inordertolimitthemaximumnumberofreducers:

sethive.exec.reducers.max=

Inordertosetaconstantnumberofreducers:

setmapred.reduce.tasks=

StartingJob=job_201105020924_0069,TrackingURL=http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0069

KillCommand=/home/hjl/hadoop/bin/../bin/hadoopjob-Dmapred.job.tracker=hadoop00:9001-killjob_201105020924_0069

2011-05-0305:41:27,913Stage-1map=0%,reduce=0%

2011-05-0305:41:30,939Stage-1map=100%,reduce=0%

2011-05-0305:41:37,993Stage-1map=100%,reduce=50%

2011-05-0305:41:41,023Stage-1map=100%,reduce=100%

EndedJob=job_201105020924_0069

Copyingdatatolocaldirectory/home/hjl/sunwg/qqq

Copyingdatatolocaldirectory/home/hjl/sunwg/qqq

4Rowsloadedto/home/hjl/sunwg/qqq

OK

Timetaken:18.496seconds

[hjl@sunwgsrc]$ll/home/hjl/sunwg/qqq

total8

-rwxrwxrwx1hjlhjl17May305:41attempt_201105020924_0069_r_000000_0

-rwxrwxrwx1hjlhjl17May305:41attempt_201105020924_0069_r_000001_0

此时产生了2个文件,分别查看每个文件的内容。

[hjl@sunwgsrc]$cat/home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000000_0

100tom

300kate

[hjl@sunwgsrc]$cat/home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000001_0

200mary

400tim

可以看得出来每个文件的内部都是排好顺序的。

orderby和sortby都可以实现排序的功能,不过具体怎么使用还得根据情况,如果数据量不是太大的情况可以使用orderby,如果数据库过于庞大,最好还是使用sortby。

本文转自http://www.oratea.net/?p=624

相关推荐