学习Hadoop都免不了WordCount,但是都是最简单的例子,而且都是以空格为划分的英文词频的统计,相比于中文,英文的统计显得简单很多,因为中文涉及到很多语义及分词的不同,通常不好统计中文词频,即使是现在的技术,也没有完全能符合人们标准的中文词频统计工具出现,不过现阶段还是有可以使用的工具的,比如IK Analyzer,今天就来尝试一下。
先感谢看到的博客指导 :http://www.cnblogs.com/jiejue/archive/2012/12/16/2820788.html
1,实验环境
hadoop 1.2.1
java 1.7
node:only one
2,数据准备
这里采用的完结篇小说《凡人修仙传》,大概20MB,个人爱好。
3,实验过程
1)修改WordCount代码,主要是应用IK Analyzer中文分词法,这是一个开源的工具,参考 http://code.google.com/p/ik-analyzer/
<span class="keyword">import</span> java.io.IOException;</p><p><span class="keyword">import</span> java.io.InputStream;</p><p><span class="keyword">import</span> java.io.InputStreamReader;</p><p><span class="keyword">import</span> java.io.Reader;</p><p><span class="keyword">import</span> java.io.ByteArrayInputStream;</p><p><span class="keyword">import</span> org.wltea.analyzer.core.IKSegmenter;</p><p><span class="keyword">import</span> org.wltea.analyzer.core.Lexeme;</p><p><span class="keyword">import</span> org.apache.hadoop.conf.Configuration;</p><p><span class="keyword">import</span> org.apache.hadoop.fs.Path;</p><p><span class="keyword">import</span> org.apache.hadoop.io.IntWritable;</p><p><span class="keyword">import</span> org.apache.hadoop.io.Text;</p><p><span class="keyword">import</span> org.apache.hadoop.mapreduce.Job;</p><p><span class="keyword">import</span> org.apache.hadoop.mapreduce.Mapper;</p><p><span class="keyword">import</span> org.apache.hadoop.mapreduce.Reducer;</p><p><span class="keyword">import</span> org.apache.hadoop.mapreduce.lib.input.FileInputFormat;</p><p><span class="keyword">import</span> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;</p><p><span class="keyword">import</span> org.apache.hadoop.util.GenericOptionsParser;</p><p><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">ChineseWordCount</span> {</span></p><p> </p><p> <span class="keyword">public</span> <span class="keyword">static</span> <span class="class"><span class="keyword">class</span> <span class="title">TokenizerMapper</span> </p><p> <span class="keyword">extends</span> <span class="title">Mapper</span><<span class="title">Object</span>, <span class="title">Text</span>, <span class="title">Text</span>, <span class="title">IntWritable</span>>{</span></p><p> </p><p> <span class="keyword">private</span> <span class="keyword">final</span> <span class="keyword">static</span> IntWritable one = <span class="keyword">new</span> IntWritable(<span class="number">1</span>);</p><p> <span class="keyword">private</span> Text word = <span class="keyword">new</span> Text();</p><p> </p><p> <span class="keyword">public</span> <span class="keyword">void</span> map(Object key, Text value, Context context</p><p> ) <span class="keyword">throws</span> IOException, InterruptedException {</p><p> </p><p> <span class="keyword">byte</span>[] bt = value.getBytes();</p><p> InputStream ip = <span class="keyword">new</span> ByteArrayInputStream(bt);</p><p> Reader read = <span class="keyword">new</span> InputStreamReader(ip);</p><p> IKSegmenter iks = <span class="keyword">new</span> IKSegmenter(read,<span class="keyword">true</span>);</p><p> Lexeme t;</p><p> <span class="keyword">while</span> ((t = iks.next()) != <span class="keyword">null</span>)</p><p> {</p><p> word.set(t.getLexemeText());</p><p> context.write(word, one);</p><p> }</p><p> }</p><p> }</p><p> </p><p> <span class="keyword">public</span> <span class="keyword">static</span> <span class="class"><span class="keyword">class</span> <span class="title">IntSumReducer</span> </p><p> <span class="keyword">extends</span> <span class="title">Reducer</span><<span class="title">Text</span>,<span class="title">IntWritable</span>,<span class="title">Text</span>,<span class="title">IntWritable</span>> {</span></p><p> <span class="keyword">private</span> IntWritable result = <span class="keyword">new</span> IntWritable();</p><p> <span class="keyword">public</span> <span class="keyword">void</span> reduce(Text key, Iterable<IntWritable> values, </p><p> Context context</p><p> ) <span class="keyword">throws</span> IOException, InterruptedException {</p><p> <span class="keyword">int</span> sum = <span class="number">0</span>;</p><p> <span class="keyword">for</span> (IntWritable val : values) {</p><p> sum += val.get();</p><p> }</p><p> result.set(sum);</p><p> context.write(key, result);</p><p> }</p><p> }</p><p> <span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">void</span> main(String[] args) <span class="keyword">throws</span> Exception {</p><p> Configuration conf = <span class="keyword">new</span> Configuration();</p><p> String[] otherArgs = <span class="keyword">new</span> GenericOptionsParser(conf, args).getRemainingArgs();</p><p> <span class="keyword">if</span> (otherArgs.length != <span class="number">2</span>) {</p><p> System.err.println(<span class="string">"Usage: wordcount <in> <out>"</span>);</p><p> System.exit(<span class="number">2</span>);</p><p> }</p><p> Job job = <span class="keyword">new</span> Job(conf, <span class="string">"word count"</span>);</p><p> job.setJarByClass(ChineseWordCount.class);</p><p> job.setMapperClass(TokenizerMapper.class);</p><p> job.setCombinerClass(IntSumReducer.class);</p><p> job.setReducerClass(IntSumReducer.class);</p><p> job.setOutputKeyClass(Text.class);</p><p> job.setOutputValueClass(IntWritable.class);</p><p> FileInputFormat.addInputPath(job, <span class="keyword">new</span> Path(otherArgs[<span class="number">0</span>]));</p><p> FileOutputFormat.setOutputPath(job, <span class="keyword">new</span> Path(otherArgs[<span class="number">1</span>]));</p><p> System.exit(job.waitForCompletion(<span class="keyword">true</span>) ? <span class="number">0</span> : <span class="number">1</span>);</p><p> }</p><p>}
2)为更方便查看任务进度,打包运行,注意要将IK Analyzer的包一起,我将打好的包以及工具包和测试文本都上传到共享 http://pan.baidu.com/s/1jGwVSEy
首先将测试文件上传到HDFS的input目录下,hadoop dfs -copyFromLocal part-all.txt input
然后开始运行 hadoop jar chinesewordcount.jar input output
等待运行完成,就不截图了。
3)数据处理,因为生成的数据并没有排序,所以还是要进行一系列的处理
head words.txt</p><p>tail words.txt</p><p> </p><p><span class="keyword">sort</span> -k2 words.txt ><span class="number">0</span>.txt</p><p>head <span class="number">0</span>.txt</p><p>tail <span class="number">0</span>.txt</p><p><span class="keyword">sort</span> -k2r words.txt><span class="number">0</span>.txt</p><p>head <span class="number">0</span>.txt</p><p>tail <span class="number">0</span>.txt</p><p><span class="keyword">sort</span> -k2rn words.txt><span class="number">0</span>.txt</p><p>head -n <span class="number">50</span> <span class="number">0</span>.txt</p><p>目标提取</p><p>awk <span class="string">'{if(length($1)>=2) print $0}'</span> <span class="number">0</span>.txt ><span class="number">1</span>.txt</p><p>最终显示结果</p><p>head <span class="number">1</span>.txt -n <span class="number">200</span> | sed = | sed <span class="string">'N;s/\n//'</span>
4)结果
不过数据还是有很多单字的情况,这是很无用的,因此最终的记过可能还是要手动处理一下,最终的结果放到共享,有兴趣的可以查看下 http://pan.baidu.com/s/1hqn66MC
4,总结
中文分词果然比较复杂,只能说继续努力。。
欢迎一起学习交流,转载请注明 http://hanlaiming.freetzi.com/?p=273