Hive取非Group by字段数据的方法 – 推酷

遇到这么一个需求,输入数据为一个ID对应多个name,要求输出数据为ID是唯一的,name随便取一个就可以。

执行以下hive ql语句:

<span class="operator"><span class="keyword">SELECT</span> </p><p>  sid,</p><p>  class_id </p><p><span class="keyword">FROM</span></p><p>  table2 </p><p><span class="keyword">GROUP</span> <span class="keyword">BY</span> sid ;</span>

SELECT

  

sid

,

  

class_id

FROM

  

table2

GROUP

BY

sid

;

会报错:

<span class="constant">FAILED</span><span class="symbol">:</span> <span class="constant">Error</span> <span class="keyword">in</span> semantic <span class="symbol">analysis:</span> <span class="constant">Line</span> <span class="number">1</span><span class="symbol">:</span><span class="number">18</span> <span class="constant">Expression</span> <span class="keyword">not</span> <span class="keyword">in</span> <span class="constant">GROUP</span> <span class="constant">BY</span> key <span class="string">'class_id'</span>

FAILED

:

Error

in

semantic

analysis

:

Line

1

:

18

Expression

not

in

GROUP

BY

key

‘class_id’

查了一下,HIVE有这么一个函数:

collect_<span class="operator"><span class="keyword">set</span>(col)</p><p>返回类型:array</p><p>解释:返回一个去重后的对象集合</span>

collect_set

(

col

)

返回类型: array

解释:返回一个去重后的对象集合

将上述的QL语句改一下:

<span class="operator"><span class="keyword">select</span> sid,collect_set(class_id) <span class="keyword">from</span> table2 <span class="keyword">group</span> <span class="keyword">by</span> sid;</span>

select

sid

,

collect_set

(

class_id

)

from

table2

group

by

sid

;

结果是这样的:

<span class="number">1</span> [<span class="number">11</span>,<span class="number">12</span>,<span class="number">13</span>]</p><p><span class="number">2</span> [<span class="number">11</span>,<span class="number">14</span>]</p><p><span class="number">3</span> [<span class="number">12</span>,<span class="number">15</span>]</p><p><span class="number">4</span> [<span class="number">12</span>,<span class="number">13</span>]</p><p><span class="number">5</span> [<span class="number">16</span>,<span class="number">14</span>]</p><p><span class="number">7</span> [<span class="number">13</span>,<span class="number">15</span>]

1

[

11

,

12

,

13

]

2

[

11

,

14

]

3

[

12

,

15

]

4

[

12

,

13

]

5

[

16

,

14

]

7

[

13

,

15

]

这个时候,我们就可以针对第二列做一些计数、求和操作,分别对应到Hive的聚合函数count、sum。

对应到本文的目的,直接从数组获取第一个元素就达到目的了,这样做:

<span class="operator"><span class="keyword">select</span> sid,collect_set(class_id)[<span class="number">0</span>] <span class="keyword">from</span> table2 <span class="keyword">group</span> <span class="keyword">by</span> sid;</span>

select

sid

,

collect_set

(

class_id

)

[

0

]

from

table2

group

by

sid

;

结果如下:

1 11</p><p>2 11</p><p>3 12</p><p>4 12</p><p>5 16</p><p>7 13

1

11

2

11

3

12

4

12

5

16

7

13

总结:

  1. Hive不允许直接访问非group by字段;
  2. 对于非group by字段,可以用Hive的collect_set函数收集这些字段,返回一个数组;
  3. 使用数字下标,可以直接访问数组中的元素;

参考文章:http://wangjunle23.blog.163.com/blog/static/117838171201310222309391/

本文地址:http://www.crazyant.net/1600.html

来源URL:http://www.tuicool.com/articles/q6JNf2