– caigen – 博客园
解决问题的方案
Hadoop上的中文分词与词频统计实践
首先来推荐相关材料:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小虾的这个统计武侠小说人名热度的段子很有意思,照虎画猫来实践一下。
与其不同的地方有:
0)其使用Hadoop Streaming,这里使用MapReduce框架。
1)不同的中文分词方法,这里使用IKAnalyzer,主页在http://code.google.com/p/ik-analyzer/。
2)这里的材料为《射雕英雄传》。哈哈,总要来一些改变。
0)使用WordCount源代码,修改其Map,在Map中使用IKAnalyzer的分词功能。
import java.io.IOException;</p><p>import java.io.InputStream;</p><p>import java.io.InputStreamReader;</p><p>import java.io.Reader;</p><p>import java.io.ByteArrayInputStream;</p><p>import org.wltea.analyzer.core.IKSegmenter;</p><p>import org.wltea.analyzer.core.Lexeme;</p><p>import org.apache.hadoop.conf.Configuration;</p><p>import org.apache.hadoop.fs.Path;</p><p>import org.apache.hadoop.io.IntWritable;</p><p>import org.apache.hadoop.io.Text;</p><p>import org.apache.hadoop.mapreduce.Job;</p><p>import org.apache.hadoop.mapreduce.Mapper;</p><p>import org.apache.hadoop.mapreduce.Reducer;</p><p>import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;</p><p>import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;</p><p>import org.apache.hadoop.util.GenericOptionsParser;</p><p>public class ChineseWordCount {</p><p> </p><p> public static class TokenizerMapper </p><p> extends Mapper<Object, Text, Text, IntWritable>{</p><p> </p><p> private final static IntWritable one = new IntWritable(1);</p><p> private Text word = new Text();</p><p> </p><p> public void map(Object key, Text value, Context context</p><p> ) throws IOException, InterruptedException {</p><p> </p><p> byte[] bt = value.getBytes();</p><p> InputStream ip = new ByteArrayInputStream(bt);</p><p> Reader read = new InputStreamReader(ip);</p><p> IKSegmenter iks = new IKSegmenter(read,true);</p><p> Lexeme t;</p><p> while ((t = iks.next()) != null)</p><p> {</p><p> word.set(t.getLexemeText());</p><p> context.write(word, one);</p><p> }</p><p> }</p><p> }</p><p> </p><p> public static class IntSumReducer </p><p> extends Reducer<Text,IntWritable,Text,IntWritable> {</p><p> private IntWritable result = new IntWritable();</p><p> public void reduce(Text key, Iterable<IntWritable> values, </p><p> Context context</p><p> ) throws IOException, InterruptedException {</p><p> int sum = 0;</p><p> for (IntWritable val : values) {</p><p> sum += val.get();</p><p> }</p><p> result.set(sum);</p><p> context.write(key, result);</p><p> }</p><p> }</p><p> public static void main(String[] args) throws Exception {</p><p> Configuration conf = new Configuration();</p><p> String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();</p><p> if (otherArgs.length != 2) {</p><p> System.err.println("Usage: wordcount <in> <out>");</p><p> System.exit(2);</p><p> }</p><p> Job job = new Job(conf, "word count");</p><p> job.setJarByClass(ChineseWordCount.class);</p><p> job.setMapperClass(TokenizerMapper.class);</p><p> job.setCombinerClass(IntSumReducer.class);</p><p> job.setReducerClass(IntSumReducer.class);</p><p> job.setOutputKeyClass(Text.class);</p><p> job.setOutputValueClass(IntWritable.class);</p><p> FileInputFormat.addInputPath(job, new Path(otherArgs[0]));</p><p> FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));</p><p> System.exit(job.waitForCompletion(true) ? 0 : 1);</p><p> }</p><p>}
1)So,完成了,本地插件模拟环境OK。打包(带上分词包)扔到集群上。
hadoop fs -put chinese_in.txt chinese_in.txt</p><p>hadoop jar WordCount.jar chinese_in.txt out0</p><p>...mapping reducing...</p><p>hadoop fs -ls ./out0</p><p>hadoop fs -get part-r-00000 words.txt
2)数据后处理:
2.1)数据排序
head words.txt</p><p>tail words.txt</p><p>sort -k2 words.txt >0.txt</p><p>head 0.txt</p><p>tail 0.txt</p><p>sort -k2r words.txt>0.txt</p><p>head 0.txt</p><p>tail 0.txt</p><p>sort -k2rn words.txt>0.txt</p><p>head -n 50 0.txt
2.2)目标提取
awk '{if(length($1)>=2) print $0}' 0.txt >1.txt
2.3)结果呈现
<span style="color: #0000ff;">head</span> <span style="color: #800080;">1</span>.txt -n <span style="color: #800080;">50</span> | <span style="color: #0000ff;">sed</span> = | <span style="color: #0000ff;">sed</span> <span style="color: #800000;">'</span><span style="color: #800000;">N;s/\n//</span><span style="color: #800000;">'</span>
1郭靖 6427<span style="color: #000000;"></p><p>2黄蓉 </span>4621<span style="color: #000000;"></p><p>3欧阳 </span>1660<span style="color: #000000;"></p><p>4甚么 </span>1430<span style="color: #000000;"></p><p>5说道 </span>1287<span style="color: #000000;"></p><p>6洪七公 </span>1225<span style="color: #000000;"></p><p>7笑道 </span>1214<span style="color: #000000;"></p><p>8自己 </span>1193<span style="color: #000000;"></p><p>9一个 </span>1160<span style="color: #000000;"></p><p>10师父 </span>1080<span style="color: #000000;"></p><p>11黄药师 </span>1059<span style="color: #000000;"></p><p>12心中 </span>1046<span style="color: #000000;"></p><p>13两人 </span>1016<span style="color: #000000;"></p><p>14武功 </span>950<span style="color: #000000;"></p><p>15咱们 </span>925<span style="color: #000000;"></p><p>16一声 </span>912<span style="color: #000000;"></p><p>17只见 </span>827<span style="color: #000000;"></p><p>18他们 </span>782<span style="color: #000000;"></p><p>19心想 </span>780<span style="color: #000000;"></p><p>20周伯通 </span>771<span style="color: #000000;"></p><p>21功夫 </span>758<span style="color: #000000;"></p><p>22不知 </span>755<span style="color: #000000;"></p><p>23欧阳克 </span>752<span style="color: #000000;"></p><p>24听得 </span>741<span style="color: #000000;"></p><p>25丘处机 </span>732<span style="color: #000000;"></p><p>26当下 </span>668<span style="color: #000000;"></p><p>27爹爹 </span>664<span style="color: #000000;"></p><p>28只是 </span>657<span style="color: #000000;"></p><p>29知道 </span>654<span style="color: #000000;"></p><p>30这时 </span>639<span style="color: #000000;"></p><p>31之中 </span>621<span style="color: #000000;"></p><p>32梅超风 </span>586<span style="color: #000000;"></p><p>33身子 </span>552<span style="color: #000000;"></p><p>34都是 </span>540<span style="color: #000000;"></p><p>35不是 </span>534<span style="color: #000000;"></p><p>36如此 </span>531<span style="color: #000000;"></p><p>37柯镇恶 </span>528<span style="color: #000000;"></p><p>38到了 </span>523<span style="color: #000000;"></p><p>39不敢 </span>522<span style="color: #000000;"></p><p>40裘千仞 </span>521<span style="color: #000000;"></p><p>41杨康 </span>520<span style="color: #000000;"></p><p>42你们 </span>509<span style="color: #000000;"></p><p>43这一 </span>495<span style="color: #000000;"></p><p>44却是 </span>478<span style="color: #000000;"></p><p>45众人 </span>476<span style="color: #000000;"></p><p>46二人 </span>475<span style="color: #000000;"></p><p>47铁木真 </span>469<span style="color: #000000;"></p><p>48怎么 </span>464<span style="color: #000000;"></p><p>49左手 </span>452<span style="color: #000000;"></p><p>50地下 </span>448
在非人名词中有很多很有意思,如:5说道7笑道12心中17只见22不知30这时49左手。
来源URL:http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html