Mahout¶
12.1 简介¶
可以使开发人员更为快捷的创建智能应用程序。
12.2 安装¶
12.2.1 要求¶
12.2.2 配置¶
<span class="n">tar</span> <span class="o">-</span><span class="n">zxvf</span> <span class="n">mahout</span><span class="o">-</span><span class="n">distribution</span><span class="o">-</span><span class="mf">0.7</span><span class="o">.</span><span class="na">tar</span><span class="o">.</span><span class="na">gz</span> <span class="o">-</span><span class="n">C</span> <span class="o">/</span><span class="n">usr</span><span class="o">/</span><span class="n">local</span><span class="o">/</span><span class="n">cloud</span><span class="o">/</span><span class="n">src</span><span class="o">/</span></p><p><span class="n">cd</span> <span class="o">/</span><span class="n">usr</span><span class="o">/</span><span class="n">local</span><span class="o">/</span><span class="n">cloud</span><span class="o">/</span></p><p><span class="n">ln</span> <span class="o">-</span><span class="n">s</span> <span class="o">-</span><span class="n">f</span> <span class="o">/</span><span class="n">usr</span><span class="o">/</span><span class="n">local</span><span class="o">/</span><span class="n">cloud</span><span class="o">/</span><span class="n">src</span><span class="o">/</span><span class="n">mahout</span><span class="o">-</span><span class="n">distribution</span><span class="o">-</span><span class="mf">0.7</span> <span class="n">mahout</span></p><p>
12.3 测试¶
12.3.1 获取测试数据¶
<span class="n">wget</span> <span class="n">http</span><span class="o">:</span><span class="c1">//archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data</span></p><p>
12.3.2 上传到Hadoop集群¶
<span class="n">hadoop</span> <span class="n">fs</span> <span class="o">-</span><span class="n">mkdir</span> <span class="n">testdata</span></p><p><span class="n">hadoop</span> <span class="n">fs</span> <span class="o">-</span><span class="n">put</span> <span class="n">synthetic_control</span><span class="o">.</span><span class="na">data</span> <span class="n">testdata</span></p><p>
12.3.3 测试各种算法¶
cd /usr/local/cloud/mahout/</p><p># canopy</p><p>hadoop jar mahout-examples-0.7-job.jar org.apache.mahout.clustering.syntheticcontrol.canopy.Job</p><p># kmeans</p><p>hadoop jar mahout-examples-0.7-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job</p><p>
12.4 推荐¶
12.4.1 协同过滤¶
-
- Taste简介
-
Taste 是 Apache Mahout
提供的一个协同过滤算法的高效实现,它是一个基于 Java
实现的可扩展的,高效的推荐引擎。Taste
既实现了最基本的基于用户的和基于内容的推荐算法,同时也提供了扩展接口,使用户可以方便的定义和实现自己的推荐算法。同时,Taste
不仅仅只适用于 Java 应用程序,它可以作为内部服务器的一个组件以 HTTP
和 Web Service 的形式向外界提供推荐的逻辑。Taste
的设计使它能满足企业对推荐引擎在性能、灵活性和可扩展性等方面的要求。
-
Taste原理
-
接口设计
-
- DataModel
-
DataModel
是用户喜好信息的抽象接口,它的具体实现可能来自任意类型的数据源以抽取用户喜好信息。Taste提供了MySQLDataModel,方便用户通过JDBC和MySQL访问数据,
此外还通过FileDataModel提供了对文件数据源的支持。
-
-
- UserSimilarity 和 ItemSimilarity
-
UserSimilarity
用于定义两个用户间的相似度,它是基于协同过滤的推荐引擎的核心部分,可以用来计算用户的“邻居”,这里我们将与当前用户口味相似的用户称为他的邻居。ItemSimilarity
类似的,定义内容之间的相似度。
-
UserNeighborhood
用于基于用户相似度的推荐方法中,推荐的内容是基于找到与当前用户喜好相似的“邻居用户”的方式产生的。UserNeighborhood
定义了确定邻居用户的方法,具体实现一般是基于 UserSimilarity
计算得到的。
-
- Recommender
-
Recommender 是推荐引擎的抽象接口,Taste
中的核心组件。程序中,为它提供一个DataModel,它可以计算出对不同用户的推荐内容。实际应用中,主要使用它的实现类
GenericUserBasedRecommender 或者
GenericItemBasedRecommender,分别实现基于用户相似度的推荐引擎或者基于内容的推荐引擎。
-
Taste演示
-
- 下载测试数据
-
http://www.grouplens.org/node/73
-
拷贝到指定目录
cp ml-1m.zip /usr/local/cloud/mahout/</p><p>cd /usr/local/cloud/mahout/</p><p>unzip ml-1m.zip</p><p># 电影信息文件 格式为MovieID::MovieName::MovieTags</p><p>cp movies.dat integration/src/main/resources/org/apache/mahout/cf/taste/example/grouplens/</p><p># 打分信息文件 格式为UserID::MovieID::Rating::Timestamp</p><p>cp ratings.dat integration/src/main/resources/org/apache/mahout/cf/taste/example/grouplens/</p><p>mvn install -DskipTests</p><p>
-
修改pom文件添加对mahout-examples的依赖
<span class="o"><</span><span class="n">dependency</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">$</span><span class="o">{</span><span class="n">project</span><span class="o">.</span><span class="na">groupId</span><span class="o">}</</span><span class="n">groupId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">mahout</span><span class="o">-</span><span class="n">examples</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="mf">0.7</span><span class="o"></</span><span class="n">version</span><span class="o">></span></p><p><span class="o"></</span><span class="n">dependency</span><span class="o">></span></p><p>
-
使用jetty进行测试
<span class="n">cd</span> <span class="n">integration</span></p><p><span class="n">mvn</span> <span class="n">jetty</span><span class="o">:</span><span class="n">run</span></p><p>
访问如下地址查看效果http://hadooptest:8080/mahout-integration/RecommenderServlet?userID=1
-
命令行方式测试
<span class="n">mvn</span> <span class="o">-</span><span class="n">q</span> <span class="n">exec</span><span class="o">:</span><span class="n">java</span> <span class="o">-</span><span class="n">Dexec</span><span class="o">.</span><span class="na">mainClass</span><span class="o">=</span><span class="s">"org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"</span> <span class="o">-</span><span class="n">Dexec</span><span class="o">.</span><span class="na">args</span><span class="o">=</span><span class="s">"-i /home/hadoop/cloud/ml-1m/ratings.dat"</span></p><p>
-
-
Taste示例
<span class="c1">// 1. 选择数据源</span></p><p><span class="c1">// 数据源格式为UserID,MovieID,Ratings</span></p><p><span class="c1">// 使用文件型数据接口</span></p><p><span class="n">DataModel</span> <span class="n">model</span> <span class="o">=</span> <span class="k">new</span> <span class="n">FileDataModel</span><span class="o">(</span><span class="k">new</span> <span class="n">File</span><span class="o">(</span><span class="s">"/Users/matrix/Documents/plan/test/ratings.txt"</span><span class="o">));</span></p><p><span class="c1">// 2. 实现相似度算法</span></p><p><span class="c1">// 使用PearsonCorrelationSimilarity实现UserSimilarity接口, 计算用户的相似度</span></p><p><span class="c1">// 其中PearsonCorrelationSimilarity是基于皮尔逊相关系数计算相似度的实现类</span></p><p><span class="c1">// 其它的还包括</span></p><p><span class="c1">// EuclideanDistanceSimilarity:基于欧几里德距离计算相似度</span></p><p><span class="c1">// TanimotoCoefficientSimilarity:基于 Tanimoto 系数计算相似度</span></p><p><span class="c1">// UncerteredCosineSimilarity:计算 Cosine 相似度</span></p><p><span class="n">UserSimilarity</span> <span class="n">similarity</span> <span class="o">=</span> <span class="k">new</span> <span class="n">PearsonCorrelationSimilarity</span><span class="o">(</span><span class="n">model</span><span class="o">);</span></p><p><span class="c1">// 可选项</span></p><p><span class="n">similarity</span><span class="o">.</span><span class="na">setPreferenceInferrer</span><span class="o">(</span><span class="k">new</span> <span class="n">AveragingPreferenceInferrer</span><span class="o">(</span><span class="n">model</span><span class="o">));</span></p><p><span class="c1">// 3. 选择邻居用户</span></p><p><span class="c1">// 使用NearestNUserNeighborhood实现UserNeighborhood接口, 选择最相似的三个用户</span></p><p><span class="c1">// 选择邻居用户可以基于'对每个用户取固定数量N个最近邻居'和'对每个用户基于一定的限制,取落在相似度限制以内的所有用户为邻居'</span></p><p><span class="c1">// 其中NearestNUserNeighborhood即基于固定数量求最近邻居的实现类</span></p><p><span class="c1">// 基于相似度限制的实现是ThresholdUserNeighborhood</span></p><p><span class="n">UserNeighborhood</span> <span class="n">neighborhood</span> <span class="o">=</span> <span class="k">new</span> <span class="n">NearestNUserNeighborhood</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="n">similarity</span><span class="o">,</span> <span class="n">model</span><span class="o">);</span></p><p><span class="c1">// 4. 实现推荐引擎</span></p><p><span class="c1">// 使用GenericUserBasedRecommender实现Recommender接口, 基于用户相似度进行推荐</span></p><p><span class="n">Recommender</span> <span class="n">recommender</span> <span class="o">=</span> <span class="k">new</span> <span class="n">GenericUserBasedRecommender</span><span class="o">(</span><span class="n">model</span><span class="o">,</span> <span class="n">neighborhood</span><span class="o">,</span> <span class="n">similarity</span><span class="o">);</span></p><p><span class="n">Recommender</span> <span class="n">cachingRecommender</span> <span class="o">=</span> <span class="k">new</span> <span class="n">CachingRecommender</span><span class="o">(</span><span class="n">recommender</span><span class="o">);</span></p><p><span class="n">List</span><span class="o"><</span><span class="n">RecommendedItem</span><span class="o">></span> <span class="n">recommendations</span> <span class="o">=</span> <span class="n">cachingRecommender</span><span class="o">.</span><span class="na">recommend</span><span class="o">(</span><span class="mi">1234</span><span class="o">,</span> <span class="mi">10</span><span class="o">);</span></p><p><span class="c1">// 输出推荐结果</span></p><p><span class="k">for</span> <span class="o">(</span><span class="n">RecommendedItem</span> <span class="n">item</span> <span class="o">:</span> <span class="n">recommendations</span><span class="o">)</span> <span class="o">{</span></p><p> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">item</span><span class="o">.</span><span class="na">getItemID</span><span class="o">()</span> <span class="o">+</span> <span class="s">"\t"</span> <span class="o">+</span> <span class="n">item</span><span class="o">.</span><span class="na">getValue</span><span class="o">());</span></p><p><span class="o">}</span></p><p>
12.4.2 聚类分析¶
-
框架设计
针对分组需求,Mahout的聚类算法将对象表示成一种简单的数据模型:向量,然后通过计算各向量间的相似度进行分组。
-
- 数据模型
-
在Mahout中向量(Vector)有多种实现.
-
- DenseVector
-
它的实现就是一个浮点数数组, 对向量里所有维度进行存储,
适合用于存储密集向量。
-
- RandomAccessSparseVector
-
基于浮点数的HashMap实现, key是整数类型, value是浮点数类型,
只存储向量中不为空的值, 并提供随机访问。
-
- SequentialAccessVector
-
实现为整数类型和浮点数类型的并行数组, 同样只存储不为空的值,
但只提供顺序访问
-
- 数据建模
-
Mahout为实现将数据建模成向量, 提供了对数据进行向量化的各种方法。
-
简单的整数类型或浮点型数据这种数据因为本身就被描述成一个向量, 因此可以直接存为向量。
<span class="c1">// 创建一个二维点集的向量组</span></p><p><span class="kd">public</span> <span class="kd">static</span> <span class="kd">final</span> <span class="kt">double</span><span class="o">[][]</span> <span class="n">points</span> <span class="o">=</span> <span class="o">{</span> <span class="o">{</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">1</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">2</span><span class="o">,</span> <span class="mi">1</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">2</span> <span class="o">},</span></p><p> <span class="o">{</span> <span class="mi">2</span><span class="o">,</span> <span class="mi">2</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">3</span><span class="o">,</span> <span class="mi">3</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">8</span><span class="o">,</span> <span class="mi">8</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">9</span><span class="o">,</span> <span class="mi">8</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">8</span><span class="o">,</span> <span class="mi">9</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">9</span><span class="o">,</span> <span class="mi">9</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">5</span><span class="o">,</span> <span class="mi">5</span> <span class="o">},</span></p><p> <span class="o">{</span> <span class="mi">5</span><span class="o">,</span> <span class="mi">6</span> <span class="o">},</span> <span class="o">{</span> <span class="mi">6</span><span class="o">,</span> <span class="mi">6</span> <span class="o">}};</span></p><p><span class="kd">public</span> <span class="kd">static</span> <span class="n">List</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="nf">getPointVectors</span><span class="o">(</span><span class="kt">double</span><span class="o">[][]</span> <span class="n">raw</span><span class="o">)</span> <span class="o">{</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">points</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Vector</span><span class="o">>();</span></p><p> <span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">raw</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span> <span class="o">{</span></p><p> <span class="kt">double</span><span class="o">[]</span> <span class="n">fr</span> <span class="o">=</span> <span class="n">raw</span><span class="o">[</span><span class="n">i</span><span class="o">];</span></p><p> <span class="c1">// 这里选择创建 RandomAccessSparseVector</span></p><p> <span class="n">Vector</span> <span class="n">vec</span> <span class="o">=</span> <span class="k">new</span> <span class="n">RandomAccessSparseVector</span><span class="o">(</span><span class="n">fr</span><span class="o">.</span><span class="na">length</span><span class="o">);</span></p><p> <span class="c1">// 将数据存放在创建的 Vector 中</span></p><p> <span class="n">vec</span><span class="o">.</span><span class="na">assign</span><span class="o">(</span><span class="n">fr</span><span class="o">);</span></p><p> <span class="n">points</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">vec</span><span class="o">);</span></p><p> <span class="o">}</span></p><p> <span class="k">return</span> <span class="n">points</span><span class="o">;</span></p><p><span class="o">}</span></p><p>
-
枚举类型数据这类数据是对物体的描述, 只是取值范围有限,
比如苹果的颜色数据包括: 红色、黄色和绿色,
则在数据建模时可以用数字表示颜色。
| 红色=1, 黄色=2, 绿色=3
<span class="c1">// 创建苹果信息数据的向量组</span></p><p><span class="kd">public</span> <span class="kd">static</span> <span class="n">List</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="nf">generateAppleData</span><span class="o">()</span> <span class="o">{</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">apples</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Vector</span><span class="o">>();</span></p><p> <span class="c1">// 这里创建的是 NamedVector,其实就是在上面几种 Vector 的基础上,</span></p><p> <span class="c1">// 为每个 Vector 提供一个可读的名字</span></p><p> <span class="n">NamedVector</span> <span class="n">apple</span> <span class="o">=</span> <span class="k">new</span> <span class="n">NamedVector</span><span class="o">(</span><span class="k">new</span> <span class="n">DenseVector</span><span class="o">(</span><span class="k">new</span> <span class="kt">double</span><span class="o">[]</span> <span class="o">{</span><span class="mf">0.11</span><span class="o">,</span> <span class="mi">510</span><span class="o">,</span> <span class="mi">1</span><span class="o">}),</span> <span class="s">"Small round green apple"</span><span class="o">);</span></p><p> <span class="n">apples</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">apple</span><span class="o">);</span></p><p> <span class="n">apple</span> <span class="o">=</span> <span class="k">new</span> <span class="n">NamedVector</span><span class="o">(</span><span class="k">new</span> <span class="n">DenseVector</span><span class="o">(</span><span class="k">new</span> <span class="kt">double</span><span class="o">[]</span> <span class="o">{</span><span class="mf">0.2</span><span class="o">,</span> <span class="mi">650</span><span class="o">,</span> <span class="mi">3</span><span class="o">}),</span> <span class="s">"Large oval red apple"</span><span class="o">);</span></p><p> <span class="n">apples</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">apple</span><span class="o">);</span></p><p> <span class="n">apple</span> <span class="o">=</span> <span class="k">new</span> <span class="n">NamedVector</span><span class="o">(</span><span class="k">new</span> <span class="n">DenseVector</span><span class="o">(</span><span class="k">new</span> <span class="kt">double</span><span class="o">[]</span> <span class="o">{</span><span class="mf">0.09</span><span class="o">,</span> <span class="mi">630</span><span class="o">,</span> <span class="mi">1</span><span class="o">}),</span> <span class="s">"Small elongated red apple"</span><span class="o">);</span></p><p> <span class="n">apples</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">apple</span><span class="o">);</span></p><p> <span class="n">apple</span> <span class="o">=</span> <span class="k">new</span> <span class="n">NamedVector</span><span class="o">(</span><span class="k">new</span> <span class="n">DenseVector</span><span class="o">(</span><span class="k">new</span> <span class="kt">double</span><span class="o">[]</span> <span class="o">{</span><span class="mf">0.18</span><span class="o">,</span> <span class="mi">520</span><span class="o">,</span> <span class="mi">2</span><span class="o">}),</span> <span class="s">"Medium oval green apple"</span><span class="o">);</span></p><p> <span class="n">apples</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">apple</span><span class="o">);</span></p><p> <span class="k">return</span> <span class="n">apples</span><span class="o">;</span></p><p><span class="o">}</span></p><p>
-
- 文本信息
-
在信息检索领域中最常用的是向量空间模型,
文本的向量空间模型就是将文本信息建模成一个向量,
其中每个维度是文本中出现的一个词的权重。
-
常用算法
-
K均值聚类算法
-
- 原理
-
给定一个N个对象的数据集, 构建数据的K个划分,
每个划分就是一个聚类, K<=N,
需要满足两个要求:1.每个划分至少包含一个对象; 2.
每个对象必须属于且仅属于一个组。
-
- 过程
-
首先创建一个初始划分, 随机的选择K个对象,
每个对象初始的代表了一个划分的中心, 对于其它的对象,
根据其与各个划分的中心的距离, 把它们分给最近的划分。
然后使用迭代进行重定位,
尝试通过对象在划分间移动以改进划分。所谓重定位,
就是当有新的对象被分配到了某个划分或者有对象离开了某个划分时,
重新计算这个划分的中心。这个过程不断重复,
直到各个划分中的对象不再变化。
-
- 优缺点
-
当划分结果比较密集, 且划分之间的区别比较明显时,
- K均值的效果比较好。K均值算法复杂度为O(NKt), 其中t为迭代次数。
-
但其要求用户必须事先给出K值,
而K值的选择一般都基于一些经验值或多次实验的结果。而且,
K均值对孤立点数据比较敏感,
少量这类的数据就能对评价值造成极大的影响。
-
示例
-
基于内存的单机应用(0.5版)
<span class="cm">/**</span></p><p><span class="cm"> * 基于内存的K均值聚类算法实现</span></p><p><span class="cm"> */</span></p><p><span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">kMeansClusterInMemoryKMeans</span><span class="o">(){</span></p><p> <span class="c1">// 指定需要聚类的个数</span></p><p> <span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">2</span><span class="o">;</span></p><p> <span class="c1">// 指定K均值聚类算法的最大迭代次数</span></p><p> <span class="kt">int</span> <span class="n">maxIter</span> <span class="o">=</span> <span class="mi">3</span><span class="o">;</span></p><p> <span class="c1">// 指定K均值聚类算法的最大距离阈值</span></p><p> <span class="kt">double</span> <span class="n">distanceThreshold</span> <span class="o">=</span> <span class="mf">0.01</span><span class="o">;</span></p><p> <span class="c1">// 声明一个计算距离的方法,这里选择了欧几里德距离</span></p><p> <span class="n">DistanceMeasure</span> <span class="n">measure</span> <span class="o">=</span> <span class="k">new</span> <span class="n">EuclideanDistanceMeasure</span><span class="o">();</span></p><p> <span class="c1">// 构建向量集,使用的是二维点集</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">pointVectors</span> <span class="o">=</span> <span class="n">getPointVectors</span><span class="o">(</span><span class="n">points</span><span class="o">);</span></p><p> <span class="c1">// 从点集向量中随机的选择k个向量作为初始分组的中心</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">randomPoints</span> <span class="o">=</span> <span class="n">chooseRandomPoints</span><span class="o">(</span><span class="n">pointVectors</span><span class="o">,</span> <span class="n">k</span><span class="o">);</span></p><p> <span class="c1">// 基于前面选中的中心构建分组</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">Cluster</span><span class="o">></span> <span class="n">clusters</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Cluster</span><span class="o">>();</span></p><p> <span class="kt">int</span> <span class="n">clusterId</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span></p><p> <span class="k">for</span><span class="o">(</span><span class="n">Vector</span> <span class="n">v</span> <span class="o">:</span> <span class="n">randomPoints</span><span class="o">){</span></p><p> <span class="n">clusters</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="k">new</span> <span class="n">Cluster</span><span class="o">(</span><span class="n">v</span><span class="o">,</span> <span class="n">clusterId</span> <span class="o">++,</span> <span class="n">measure</span><span class="o">));</span></p><p> <span class="o">}</span></p><p> <span class="c1">// 调用 KMeansClusterer.clusterPoints 方法执行K均值聚类</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">List</span><span class="o"><</span><span class="n">Cluster</span><span class="o">>></span> <span class="n">finalClusters</span> <span class="o">=</span> <span class="n">KMeansClusterer</span><span class="o">.</span><span class="na">clusterPoints</span><span class="o">(</span><span class="n">pointVectors</span><span class="o">,</span> <span class="n">clusters</span><span class="o">,</span> <span class="n">measure</span><span class="o">,</span> <span class="n">maxIter</span><span class="o">,</span> <span class="n">distanceThreshold</span><span class="o">);</span></p><p> <span class="c1">// 打印最终的聚类结果</span></p><p> <span class="k">for</span><span class="o">(</span><span class="n">Cluster</span> <span class="n">cluster</span> <span class="o">:</span> <span class="n">finalClusters</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">finalClusters</span><span class="o">.</span><span class="na">size</span><span class="o">()</span> <span class="o">-</span><span class="mi">1</span><span class="o">))</span> <span class="o">{</span></p><p> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Cluster id: "</span> <span class="o">+</span> <span class="n">cluster</span><span class="o">.</span><span class="na">getId</span><span class="o">()</span> <span class="o">+</span> <span class="s">" center: "</span> <span class="o">+</span> <span class="n">cluster</span><span class="o">.</span><span class="na">getCenter</span><span class="o">().</span><span class="na">asFormatString</span><span class="o">());</span></p><p> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"\tPoints: "</span> <span class="o">+</span> <span class="n">cluster</span><span class="o">.</span><span class="na">getNumPoints</span><span class="o">());</span></p><p> <span class="o">}</span></p><p><span class="o">}</span></p><p>
-
-
基于Hadoop的集群应用(0.5版)注意:首先需要在MVN工程中添加如下依赖
<span class="o"><</span><span class="n">dependency</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">hadoop</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">hadoop</span><span class="o">-</span><span class="n">core</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="mf">1.0</span><span class="o">.</span><span class="mi">4</span><span class="o"></</span><span class="n">version</span><span class="o">></span></p><p><span class="o"></</span><span class="n">dependency</span><span class="o">></span></p><p><span class="o"><</span><span class="n">dependency</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">mahout</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">mahout</span><span class="o">-</span><span class="n">core</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="mf">0.5</span><span class="o"></</span><span class="n">version</span><span class="o">></span></p><p><span class="o"></</span><span class="n">dependency</span><span class="o">></span></p><p><span class="o"><</span><span class="n">dependency</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">mahout</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">mahout</span><span class="o">-</span><span class="n">utils</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="mf">0.5</span><span class="o"></</span><span class="n">version</span><span class="o">></span></p><p><span class="o"></</span><span class="n">dependency</span><span class="o">></span></p><p><span class="o"><</span><span class="n">dependency</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">mahout</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">mahout</span><span class="o">-</span><span class="n">math</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span></p><p> <span class="o"><</span><span class="n">version</span><span class="o">></span><span class="mf">0.5</span><span class="o"></</span><span class="n">version</span><span class="o">></span></p><p><span class="o"></</span><span class="n">dependency</span><span class="o">></span></p><p>
其次在集群上运行前需要进行相关配置# 需要在$HADOOP_HOME/conf/hadoop-env.sh中设置CLASSPATH</p><p>export MAHOUT_HOME=/usr/local/cloud/mahout</p><p>for f in $MAHOUT_HOME/lib/*.jar; do</p><p> HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$f;</p><p>done</p><p>for f in $MAHOUT_HOME/*.jar; do</p><p> HADOOP_CLASSPATH=$(HADOOP_CLASSPATH):$f;</p><p>done</p><p>
然后即可测试如下代码<span class="cm">/**</span></p><p><span class="cm"> * 基于 Hadoop 的K均值聚类算法实现</span></p><p><span class="cm"> * @throws Exception</span></p><p><span class="cm"> */</span></p><p><span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">kMeansClusterUsingMapReduce</span> <span class="o">()</span> <span class="kd">throws</span> <span class="n">Exception</span><span class="o">{</span></p><p> <span class="n">Configuration</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Configuration</span><span class="o">();</span></p><p> <span class="c1">// 声明一个计算距离的方法,这里选择了欧几里德距离</span></p><p> <span class="n">DistanceMeasure</span> <span class="n">measure</span> <span class="o">=</span> <span class="k">new</span> <span class="n">EuclideanDistanceMeasure</span><span class="o">();</span></p><p> <span class="c1">// 指定输入路径,基于 Hadoop 的实现是通过指定输入输出的文件路径来指定数据源的。</span></p><p> <span class="n">Path</span> <span class="n">testpoints</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="s">"testpoints"</span><span class="o">);</span></p><p> <span class="n">Path</span> <span class="n">output</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="s">"output"</span><span class="o">);</span></p><p> <span class="c1">// 清空输入输出路径下的数据</span></p><p> <span class="n">HadoopUtil</span><span class="o">.</span><span class="na">delete</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="n">testpoints</span><span class="o">);</span></p><p> <span class="n">HadoopUtil</span><span class="o">.</span><span class="na">delete</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="n">output</span><span class="o">);</span></p><p> <span class="n">RandomUtils</span><span class="o">.</span><span class="na">useTestSeed</span><span class="o">();</span></p><p> <span class="c1">// 在输入路径下生成点集,与内存的方法不同,这里需要把所有的向量写进文件</span></p><p> <span class="n">writePointsToFile</span><span class="o">(</span><span class="n">testpoints</span><span class="o">);</span></p><p> <span class="c1">// 指定需要聚类的个数,这里选择 2 类</span></p><p> <span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">2</span><span class="o">;</span></p><p> <span class="c1">// 指定 K 均值聚类算法的最大迭代次数</span></p><p> <span class="kt">int</span> <span class="n">maxIter</span> <span class="o">=</span> <span class="mi">3</span><span class="o">;</span></p><p> <span class="c1">// 指定 K 均值聚类算法的最大距离阈值</span></p><p> <span class="kt">double</span> <span class="n">distanceThreshold</span> <span class="o">=</span> <span class="mf">0.01</span><span class="o">;</span></p><p> <span class="c1">// 随机的选择k个作为簇的中心</span></p><p> <span class="n">Path</span> <span class="n">clusters</span> <span class="o">=</span> <span class="n">RandomSeedGenerator</span><span class="o">.</span><span class="na">buildRandom</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="n">testpoints</span><span class="o">,</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="n">output</span><span class="o">,</span> <span class="s">"clusters-0"</span><span class="o">),</span> <span class="n">k</span><span class="o">,</span> <span class="n">measure</span><span class="o">);</span></p><p> <span class="c1">// 调用 KMeansDriver.runJob 方法执行 K 均值聚类算法</span></p><p> <span class="n">KMeansDriver</span><span class="o">.</span><span class="na">run</span><span class="o">(</span><span class="n">testpoints</span><span class="o">,</span> <span class="n">clusters</span><span class="o">,</span> <span class="n">output</span><span class="o">,</span> <span class="n">measure</span><span class="o">,</span> <span class="n">distanceThreshold</span><span class="o">,</span> <span class="n">maxIter</span><span class="o">,</span> <span class="kc">true</span><span class="o">,</span> <span class="kc">true</span><span class="o">);</span></p><p> <span class="c1">// 调用 ClusterDumper 的 printClusters 方法将聚类结果打印出来。</span></p><p> <span class="n">ClusterDumper</span> <span class="n">clusterDumper</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ClusterDumper</span><span class="o">(</span><span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="n">output</span><span class="o">,</span> <span class="s">"clusters-"</span> <span class="o">+</span> <span class="o">(</span><span class="n">maxIter</span> <span class="o">-</span> <span class="mi">1</span><span class="o">)),</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="n">output</span><span class="o">,</span> <span class="s">"clusteredPoints"</span><span class="o">));</span></p><p> <span class="n">clusterDumper</span><span class="o">.</span><span class="na">printClusters</span><span class="o">(</span><span class="kc">null</span><span class="o">);</span></p><p><span class="o">}</span></p><p>
-
基于Hadoop的集群应用(0.7版)
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">kMeansClusterUsingMapReduce</span><span class="o">()</span> <span class="kd">throws</span> <span class="n">IOException</span><span class="o">,</span> <span class="n">InterruptedException</span><span class="o">,</span></p><p> <span class="n">ClassNotFoundException</span> <span class="o">{</span></p><p> <span class="n">Configuration</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Configuration</span><span class="o">();</span></p><p> <span class="c1">// 声明一个计算距离的方法,这里选择了欧几里德距离</span></p><p> <span class="n">DistanceMeasure</span> <span class="n">measure</span> <span class="o">=</span> <span class="k">new</span> <span class="n">EuclideanDistanceMeasure</span><span class="o">();</span></p><p> <span class="n">File</span> <span class="n">testData</span> <span class="o">=</span> <span class="k">new</span> <span class="n">File</span><span class="o">(</span><span class="s">"input"</span><span class="o">);</span></p><p> <span class="k">if</span> <span class="o">(!</span><span class="n">testData</span><span class="o">.</span><span class="na">exists</span><span class="o">())</span> <span class="o">{</span></p><p> <span class="n">testData</span><span class="o">.</span><span class="na">mkdir</span><span class="o">();</span></p><p> <span class="o">}</span></p><p> <span class="c1">// 指定输入路径,基于 Hadoop 的实现是通过指定输入输出的文件路径来指定数据源的。</span></p><p> <span class="n">Path</span> <span class="n">samples</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="s">"input/file1"</span><span class="o">);</span></p><p> <span class="c1">// 在输入路径下生成点集,这里需要把所有的向量写进文件</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">Vector</span><span class="o">></span> <span class="n">sampleData</span> <span class="o">=</span> <span class="k">new</span> <span class="n">ArrayList</span><span class="o"><</span><span class="n">Vector</span><span class="o">>();</span></p><p> <span class="n">RandomPointsUtil</span><span class="o">.</span><span class="na">generateSamples</span><span class="o">(</span><span class="n">sampleData</span><span class="o">,</span> <span class="mi">400</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">3</span><span class="o">);</span></p><p> <span class="n">RandomPointsUtil</span><span class="o">.</span><span class="na">generateSamples</span><span class="o">(</span><span class="n">sampleData</span><span class="o">,</span> <span class="mi">300</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="mf">0.5</span><span class="o">);</span></p><p> <span class="n">RandomPointsUtil</span><span class="o">.</span><span class="na">generateSamples</span><span class="o">(</span><span class="n">sampleData</span><span class="o">,</span> <span class="mi">300</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="mi">2</span><span class="o">,</span> <span class="mf">0.1</span><span class="o">);</span></p><p> <span class="n">ClusterHelper</span><span class="o">.</span><span class="na">writePointsToFile</span><span class="o">(</span><span class="n">sampleData</span><span class="o">,</span> <span class="n">conf</span><span class="o">,</span> <span class="n">samples</span><span class="o">);</span></p><p> <span class="c1">// 指定输出路径</span></p><p> <span class="n">Path</span> <span class="n">output</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="s">"output"</span><span class="o">);</span></p><p> <span class="n">HadoopUtil</span><span class="o">.</span><span class="na">delete</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="n">output</span><span class="o">);</span></p><p> <span class="c1">// 指定需要聚类的个数,这里选择3</span></p><p> <span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">3</span><span class="o">;</span></p><p> <span class="c1">// 指定 K 均值聚类算法的最大迭代次数</span></p><p> <span class="kt">int</span> <span class="n">maxIter</span> <span class="o">=</span> <span class="mi">10</span><span class="o">;</span></p><p> <span class="c1">// 指定 K 均值聚类算法的最大距离阈值</span></p><p> <span class="kt">double</span> <span class="n">distanceThreshold</span> <span class="o">=</span> <span class="mf">0.01</span><span class="o">;</span></p><p> <span class="c1">// 随机的选择k个作为簇的中心</span></p><p> <span class="n">Path</span> <span class="n">clustersIn</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Path</span><span class="o">(</span><span class="n">output</span><span class="o">,</span> <span class="s">"random-seeds"</span><span class="o">);</span></p><p> <span class="n">RandomSeedGenerator</span><span class="o">.</span><span class="na">buildRandom</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="n">samples</span><span class="o">,</span> <span class="n">clustersIn</span><span class="o">,</span> <span class="n">k</span><span class="o">,</span> <span class="n">measure</span><span class="o">);</span></p><p> <span class="c1">// 调用 KMeansDriver.run 方法执行 K 均值聚类算法</span></p><p> <span class="n">KMeansDriver</span><span class="o">.</span><span class="na">run</span><span class="o">(</span><span class="n">samples</span><span class="o">,</span> <span class="n">clustersIn</span><span class="o">,</span> <span class="n">output</span><span class="o">,</span> <span class="n">measure</span><span class="o">,</span> <span class="n">distanceThreshold</span><span class="o">,</span> <span class="n">maxIter</span><span class="o">,</span> <span class="kc">true</span><span class="o">,</span> <span class="mf">0.0</span><span class="o">,</span> <span class="kc">true</span><span class="o">);</span></p><p> <span class="c1">// 输出结果</span></p><p> <span class="n">List</span><span class="o"><</span><span class="n">List</span><span class="o"><</span><span class="n">Cluster</span><span class="o">>></span> <span class="n">Clusters</span> <span class="o">=</span> <span class="n">ClusterHelper</span><span class="o">.</span><span class="na">readClusters</span><span class="o">(</span><span class="n">conf</span><span class="o">,</span> <span class="n">output</span><span class="o">);</span></p><p> <span class="k">for</span> <span class="o">(</span><span class="n">Cluster</span> <span class="n">cluster</span> <span class="o">:</span> <span class="n">Clusters</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">Clusters</span><span class="o">.</span><span class="na">size</span><span class="o">()</span> <span class="o">-</span> <span class="mi">1</span><span class="o">))</span> <span class="o">{</span></p><p> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">"Cluster id: "</span> <span class="o">+</span> <span class="n">cluster</span><span class="o">.</span><span class="na">getId</span><span class="o">()</span> <span class="o">+</span> <span class="s">" center: "</span> <span class="o">+</span> <span class="n">cluster</span><span class="o">.</span><span class="na">getCenter</span><span class="o">().</span><span class="na">asFormatString</span><span class="o">());</span></p><p> <span class="o">}</span></p><p><span class="o">}</span></p><p>
输出结果为:<span class="n">Cluster</span> <span class="n">id</span><span class="o">:</span> <span class="mi">997</span> <span class="n">center</span><span class="o">:</span> <span class="o">{</span><span class="mi">1</span><span class="o">:</span><span class="mf">3.6810451340150467</span><span class="o">,</span><span class="mi">0</span><span class="o">:</span><span class="mf">3.8594229542914538</span><span class="o">}</span></p><p><span class="n">Cluster</span> <span class="n">id</span><span class="o">:</span> <span class="mi">998</span> <span class="n">center</span><span class="o">:</span> <span class="o">{</span><span class="mi">1</span><span class="o">:</span><span class="mf">2.068611196044424</span><span class="o">,</span><span class="mi">0</span><span class="o">:-</span><span class="mf">0.5471173292759096</span><span class="o">}</span></p><p><span class="n">Cluster</span> <span class="n">id</span><span class="o">:</span> <span class="mi">999</span> <span class="n">center</span><span class="o">:</span> <span class="o">{</span><span class="mi">1</span><span class="o">:-</span><span class="mf">0.6392433868275759</span><span class="o">,</span><span class="mi">0</span><span class="o">:</span><span class="mf">1.2972649625289365</span><span class="o">}</span></p><p>
-
-
12.4.3 分类分析¶
来源URL:http://hadoop.readthedocs.org/en/latest/Hadoop-Mahout.html