hive实战之使用csdn600万数据统计csdn前50个常用密码(top50)

下载数据文件,并分析数据

zdg # 12344321 # zdg@csdn.net
LaoZheng # 670203313747 # chengming_zheng@163.com
fstao # 730413 # fstao@tom.com
huwolf # 2535263 # hujiye@263.net

从上面的数据我们可以猜出,依次为用户名、密码、电子邮件。其中使用 # 分隔数据

这里使用正则表达式来匹配*

(\S+)\s+#\s+(\S+)\s+#\s+(\S+)

在hive中创建数据库和表

CREATE DATABASE csdn;

USE csdn;

CREATE TABLE csdn (
username STRING,
password STRING,
email STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\S+)\\s+#\\s+(\\S+)\\s+#\\s+(\\S+)",
"output.format.string" = "%1$s %2$s %3$s"
)
STORED AS TEXTFILE;

注:

上面在hive中创建了csdn数据库,在csdn库中创建了csdn表.

使用正则表达式指定数据分隔符,从源码中我们发现该正则表达式是以字符串的形式传入,所以这里进行了转义

将数据与hive中的表关联

本地数据

LOAD DATA LOCAL INPATH 'www.csdn.net.sql' INTO TABLE csdn;

数据已经在hdfs上存在

LOAD DATA INPATH 'www.csdn.net.sql' INTO TABLE csdn;

添加hive-contrib-0.12.0.jar到hive中

若Hive查询时报Hive ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.RegexSerDe解决方法

vi /opt/hive-0.12.0/conf/hive-site.xml

<property>
<name>hive.aux.jars.path</name>
<value>file:///opt/hive-0.12.0/lib/hive-contrib-0.12.0.jar</value>
<description>This JAR file available to all users for all jobs</description>
</property>

执行add jar 命令 将hive-contrib.jar 加入再执行hive语句即可

hive> add jar /opt/hive-0.12.0/lib/hive-contrib-0.12.0.jar;
Added /opt/hive-0.12.0/lib/hive-contrib-0.12.0.jar to class path
Added resource: /opt/hive-0.12.0/lib/hive-contrib-0.12.0.jar

一些简单的测试sql

SELECT username,password,email FROM csdn LIMIT 10; --列出10条记录

SELECT count(*) FROM csdn; --统计总记录数

SELECT password,count(password) FROM csdn GROUP BY password; --按密码分组查询,查询密码,及该密码使用的次数

统计前50个重复密码

SELECT password,count(password) as c FROM csdn GROUP BY password ORDER BY c DESC LIMIT 50;

执行后结果:

123456789	235032
12345678 212751
11111111 76346
dearbook 45899
00000000 34952
123123123 19986
1234567890 17791
88888888 15033
111111111 6995
147258369 5965
987654321 5553
aaaaaaaa 5459
1111111111 5145
66666666 5025
a123456789 4435
11223344 4096
1qaz2wsx 3667
xiazhili 3649
789456123 3610
password 3497
87654321 3281
qqqqqqqq 3277
000000000 3175
qwertyuiop 3143
qq123456 3094
iloveyou 3077
31415926 3061
12344321 2985
0000000000 2886
asdfghjkl 2826
1q2w3e4r 2797
123456abc 2580
0123456789 2578
123654789 2573
12121212 2540
qazwsxedc 2515
abcd1234 2396
12341234 2380
110110110 2348
asdasdasd 2298
22222222 2243
123321123 2166
abc123456 2160
123456 2141
a12345678 2138
123456123 2113
a1234567 2106
1234qwer 2100
qwertyui 1989
123456789a 1986

前面为密码,后面为该密码使用次数

hive执行结果输出

导出到本地文件系统

INSERT OVERWRITE LOCAL DIRECTORY '/var/tmp/top50pwd' 
SELECT password,count(password) as c FROM csdn GROUP BY password ORDER BY c DESC LIMIT 50;

或者导出的时候指定导出文件分隔符

INSERT OVERWRITE LOCAL DIRECTORY '/var/tmp/top50pwd' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
SELECT password,count(password) as c FROM csdn GROUP BY password ORDER BY c DESC LIMIT 50;

导出到hdfs文件系统

INSERT OVERWRITE DIRECTORY '/user/scott/top50pwd'
SELECT password,count(password) as c FROM csdn GROUP BY password ORDER BY c DESC LIMIT 50;

注意:导出到hdfs文件中不能自定义导出文件分隔符,切保证/user/scott已经在hdfs文件系统中存在

进入hive shell的时候直接跟参数执行

我们可以用hive的-e和-f参数来导出数据。其中-e表示后面直接跟带双引号的sql语句;而-f是跟一个文件,文件的内容为一个sql语句。

hive -S -e "USE csdn;SELECT password,count(password) as c FROM csdn GROUP BY password ORDER BY c DESC LIMIT 50" > /var/tmp/top50pwd.txt

top50pwd.sql

USE csdn;
SELECT password,count(password) as c FROM csdn GROUP BY password ORDER BY c DESC LIMIT 50;
hive -S -f top50pwd.sql > /var/tmp/top50pwd.txt

导出结果到Hive的另一个表中

将select的结果放到另一个表格中(首先要用create table创建该表)

CREATE TABLE top50pwd (
password STRING,
times BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
INSERT OVERWRITE TABLE top50pwd
SELECT password,count(password) as c FROM csdn GROUP BY password ORDER BY c DESC LIMIT 50;