Hadoop上的中文分词与词频统计实践 – caigen – 博客园

Hadoop上的中文分词与词频统计实践<span id="more-1664"></span></p> <p> – caigen – 博客园

Hadoop上的中文分词与词频统计实践

首先来推荐相关材料:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小虾的这个统计武侠小说人名热度的段子很有意思,照虎画猫来实践一下。

与其不同的地方有:

  0)其使用Hadoop Streaming,这里使用MapReduce框架。

  1)不同的中文分词方法,这里使用IKAnalyzer,主页在http://code.google.com/p/ik-analyzer/。

  2)这里的材料为《射雕英雄传》。哈哈,总要来一些改变。

 

0)使用WordCount源代码,修改其Map,在Map中使用IKAnalyzer的分词功能。

复制代码

import java.io.IOException;</p><p>import java.io.InputStream;</p><p>import java.io.InputStreamReader;</p><p>import java.io.Reader;</p><p>import java.io.ByteArrayInputStream;</p><p>import org.wltea.analyzer.core.IKSegmenter;</p><p>import org.wltea.analyzer.core.Lexeme;</p><p>import org.apache.hadoop.conf.Configuration;</p><p>import org.apache.hadoop.fs.Path;</p><p>import org.apache.hadoop.io.IntWritable;</p><p>import org.apache.hadoop.io.Text;</p><p>import org.apache.hadoop.mapreduce.Job;</p><p>import org.apache.hadoop.mapreduce.Mapper;</p><p>import org.apache.hadoop.mapreduce.Reducer;</p><p>import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;</p><p>import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;</p><p>import org.apache.hadoop.util.GenericOptionsParser;</p><p>public class ChineseWordCount {</p><p>    </p><p>      public static class TokenizerMapper </p><p>           extends Mapper&lt;Object, Text, Text, IntWritable&gt;{</p><p>        </p><p>        private final static IntWritable one = new IntWritable(1);</p><p>        private Text word = new Text();</p><p>          </p><p>        public void map(Object key, Text value, Context context</p><p>                        ) throws IOException, InterruptedException {</p><p>            </p><p>            byte[] bt = value.getBytes();</p><p>            InputStream ip = new ByteArrayInputStream(bt);</p><p>            Reader read = new InputStreamReader(ip);</p><p>            IKSegmenter iks = new IKSegmenter(read,true);</p><p>            Lexeme t;</p><p>            while ((t = iks.next()) != null)</p><p>            {</p><p>                word.set(t.getLexemeText());</p><p>                context.write(word, one);</p><p>            }</p><p>        }</p><p>      }</p><p>  </p><p>  public static class IntSumReducer </p><p>       extends Reducer&lt;Text,IntWritable,Text,IntWritable&gt; {</p><p>    private IntWritable result = new IntWritable();</p><p>    public void reduce(Text key, Iterable&lt;IntWritable&gt; values, </p><p>                       Context context</p><p>                       ) throws IOException, InterruptedException {</p><p>      int sum = 0;</p><p>      for (IntWritable val : values) {</p><p>        sum += val.get();</p><p>      }</p><p>      result.set(sum);</p><p>      context.write(key, result);</p><p>    }</p><p>  }</p><p>  public static void main(String[] args) throws Exception {</p><p>    Configuration conf = new Configuration();</p><p>    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();</p><p>    if (otherArgs.length != 2) {</p><p>      System.err.println("Usage: wordcount &lt;in&gt; &lt;out&gt;");</p><p>      System.exit(2);</p><p>    }</p><p>    Job job = new Job(conf, "word count");</p><p>    job.setJarByClass(ChineseWordCount.class);</p><p>    job.setMapperClass(TokenizerMapper.class);</p><p>    job.setCombinerClass(IntSumReducer.class);</p><p>    job.setReducerClass(IntSumReducer.class);</p><p>    job.setOutputKeyClass(Text.class);</p><p>    job.setOutputValueClass(IntWritable.class);</p><p>    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));</p><p>    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));</p><p>    System.exit(job.waitForCompletion(true) ? 0 : 1);</p><p>  }</p><p>}

复制代码

1)So,完成了,本地插件模拟环境OK。打包(带上分词包)扔到集群上。

复制代码

hadoop fs -put chinese_in.txt chinese_in.txt</p><p>hadoop jar WordCount.jar chinese_in.txt out0</p><p>...mapping reducing...</p><p>hadoop fs -ls ./out0</p><p>hadoop fs -get part-r-00000 words.txt

复制代码

2)数据后处理:

2.1)数据排序

复制代码

head words.txt</p><p>tail words.txt</p><p>sort -k2 words.txt &gt;0.txt</p><p>head 0.txt</p><p>tail 0.txt</p><p>sort -k2r words.txt&gt;0.txt</p><p>head 0.txt</p><p>tail 0.txt</p><p>sort -k2rn words.txt&gt;0.txt</p><p>head -n 50 0.txt

复制代码

2.2)目标提取

awk '{if(length($1)>=2) print $0}' 0.txt >1.txt

2.3)结果呈现

<span style="color: #0000ff;">head</span> <span style="color: #800080;">1</span>.txt -n <span style="color: #800080;">50</span> | <span style="color: #0000ff;">sed</span> = | <span style="color: #0000ff;">sed</span> <span style="color: #800000;">'</span><span style="color: #800000;">N;s/\n//</span><span style="color: #800000;">'</span>

复制代码

1郭靖   6427<span style="color: #000000;"></p><p>2黄蓉   </span>4621<span style="color: #000000;"></p><p>3欧阳   </span>1660<span style="color: #000000;"></p><p>4甚么   </span>1430<span style="color: #000000;"></p><p>5说道   </span>1287<span style="color: #000000;"></p><p>6洪七公 </span>1225<span style="color: #000000;"></p><p>7笑道   </span>1214<span style="color: #000000;"></p><p>8自己   </span>1193<span style="color: #000000;"></p><p>9一个   </span>1160<span style="color: #000000;"></p><p>10师父  </span>1080<span style="color: #000000;"></p><p>11黄药师        </span>1059<span style="color: #000000;"></p><p>12心中  </span>1046<span style="color: #000000;"></p><p>13两人  </span>1016<span style="color: #000000;"></p><p>14武功  </span>950<span style="color: #000000;"></p><p>15咱们  </span>925<span style="color: #000000;"></p><p>16一声  </span>912<span style="color: #000000;"></p><p>17只见  </span>827<span style="color: #000000;"></p><p>18他们  </span>782<span style="color: #000000;"></p><p>19心想  </span>780<span style="color: #000000;"></p><p>20周伯通        </span>771<span style="color: #000000;"></p><p>21功夫  </span>758<span style="color: #000000;"></p><p>22不知  </span>755<span style="color: #000000;"></p><p>23欧阳克        </span>752<span style="color: #000000;"></p><p>24听得  </span>741<span style="color: #000000;"></p><p>25丘处机        </span>732<span style="color: #000000;"></p><p>26当下  </span>668<span style="color: #000000;"></p><p>27爹爹  </span>664<span style="color: #000000;"></p><p>28只是  </span>657<span style="color: #000000;"></p><p>29知道  </span>654<span style="color: #000000;"></p><p>30这时  </span>639<span style="color: #000000;"></p><p>31之中  </span>621<span style="color: #000000;"></p><p>32梅超风        </span>586<span style="color: #000000;"></p><p>33身子  </span>552<span style="color: #000000;"></p><p>34都是  </span>540<span style="color: #000000;"></p><p>35不是  </span>534<span style="color: #000000;"></p><p>36如此  </span>531<span style="color: #000000;"></p><p>37柯镇恶        </span>528<span style="color: #000000;"></p><p>38到了  </span>523<span style="color: #000000;"></p><p>39不敢  </span>522<span style="color: #000000;"></p><p>40裘千仞        </span>521<span style="color: #000000;"></p><p>41杨康  </span>520<span style="color: #000000;"></p><p>42你们  </span>509<span style="color: #000000;"></p><p>43这一  </span>495<span style="color: #000000;"></p><p>44却是  </span>478<span style="color: #000000;"></p><p>45众人  </span>476<span style="color: #000000;"></p><p>46二人  </span>475<span style="color: #000000;"></p><p>47铁木真        </span>469<span style="color: #000000;"></p><p>48怎么  </span>464<span style="color: #000000;"></p><p>49左手  </span>452<span style="color: #000000;"></p><p>50地下  </span>448

复制代码

在非人名词中有很多很有意思,如:5说道7笑道12心中17只见22不知30这时49左手。

来源URL:http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html