Spark On YARN 集群安装部署 | Jark’s Blog

最近毕设需要用到 Spark 集群,所以就记录下了部署的过程。我们知道 Spark 官方提供了三种集群部署方案: Standalone, Mesos, YARN。其中 Standalone 最为方便,本文主要讲述结合 YARN 的部署方案。

软件环境:

Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-32-generic x86_64)
Hadoop: 2.6.0
Spark: 1.3.0

0 写在前面

本例中的演示均为非 root 权限,所以有些命令行需要加 sudo,如果你是 root 身份运行,请忽略 sudo。下载安装的软件建议都放在 home 目录之上,比如~/workspace中,这样比较方便,以免权限问题带来不必要的麻烦。

1. 环境准备

修改主机名

我们将搭建1个master,2个slave的集群方案。首先修改主机名vi /etc/hostname,在master上修改为master,其中一个slave上修改为slave1,另一个同理。

配置hosts

在每台主机上修改host文件

<span class="line"><span class="title">vi</span> /etc/hosts</span><br><span class="line"></span><br><span class="line"><span class="number">10.1.1.107</span>      master</span><br><span class="line"><span class="number">10.1.1.108</span>      slave1</span><br><span class="line"><span class="number">10.1.1.109</span>      slave2</span><br>

配置之后ping一下用户名看是否生效

<span class="line"><span class="built_in">ping</span> slave1</span><br><span class="line"><span class="built_in">ping</span> slave2</span><br>

SSH 免密码登录

安装Openssh server

<span class="line">sudo apt-get <span class="operator"><span class="keyword">install</span> openssh-<span class="keyword">server</span></span></span><br>

在所有机器上都生成私钥和公钥

<span class="line">ssh-keygen -t rsa   <span class="preprocessor">#一路回车</span></span><br>

需要让机器间都能相互访问,就把每个机子上的id_rsa.pub发给master节点,传输公钥可以用scp来传输。

<span class="line">scp ~/.ssh/id_rsa<span class="class">.pub</span> spark@master:~/.ssh/id_rsa<span class="class">.pub</span><span class="class">.slave1</span></span><br>

在master上,将所有公钥加到用于认证的公钥文件authorized_keys

<span class="line">cat ~<span class="regexp">/.ssh/id</span>_rsa.pub* <span class="prompt">&gt;&gt; </span>~<span class="regexp">/.ssh/authorized</span>_keys</span><br>

将公钥文件authorized_keys分发给每台slave

<span class="line">scp ~<span class="regexp">/.ssh/authorized</span>_keys spark<span class="variable">@master</span><span class="symbol">:~/</span>.ssh/</span><br>

在每台机子上验证SSH无密码通信

<span class="line"><span class="title">ssh</span> master</span><br><span class="line">ssh slave1</span><br><span class="line">ssh slave2</span><br>

如果登陆测试不成功,则可能需要修改文件authorized_keys的权限(权限的设置非常重要,因为不安全的设置安全设置,会让你不能使用RSA功能 )

<span class="line">chmod <span class="number">600</span> ~<span class="regexp">/.ssh/authorized</span>_keys</span><br>

安装 Java

从官网下载最新版 Java 就可以,Spark官方说明 Java 只要是6以上的版本都可以,我下的是 jdk-7u75-linux-x64.gz
~/workspace目录下直接解压

<span class="line">tar -zxvf jdk-<span class="number">7u</span>75-linux-x64.gz</span><br>

修改环境变量sudo vi /etc/profile,添加下列内容,注意将home路径替换成你的

<span class="line"><span class="built_in">export</span> WORK_SPACE=/home/spark/workspace/</span><br><span class="line"><span class="built_in">export</span> JAVA_HOME=<span class="variable">$WORK_SPACE</span>/jdk1.<span class="number">7.0</span>_75</span><br><span class="line"><span class="built_in">export</span> JRE_HOME=/home/spark/work/jdk1.<span class="number">7.0</span>_75/jre</span><br><span class="line"><span class="built_in">export</span> PATH=<span class="variable">$JAVA_HOME</span>/bin:<span class="variable">$JAVA_HOME</span>/jre/bin:<span class="variable">$PATH</span></span><br><span class="line"><span class="built_in">export</span> CLASSPATH=<span class="variable">$CLASSPATH</span>:.:<span class="variable">$JAVA_HOME</span>/lib:<span class="variable">$JAVA_HOME</span>/jre/lib</span><br>

然后使环境变量生效,并验证 Java 是否安装成功

<span class="line">$ source /etc/profile   #生效环境变量</span><br><span class="line">$ java -version         #如果打印出如下版本信息,则说明安装成功</span><br><span class="line">java version <span class="string">"1.7.0_75"</span></span><br><span class="line"><span class="function"><span class="title">Java</span><span class="params">(TM)</span></span> SE Runtime Environment (build <span class="number">1.7</span>.<span class="number">0</span>_75-b13)</span><br><span class="line">Java <span class="function"><span class="title">HotSpot</span><span class="params">(TM)</span></span> <span class="number">64</span>-Bit Server VM (build <span class="number">24.75</span>-b04, mixed mode)</span><br>

安装 Scala

Spark官方要求 Scala 版本为 2.10.x,注意不要下错版本,我这里下了 2.10.4,官方下载地址(可恶的天朝大局域网下载 Scala 龟速一般)。

同样我们在~/workspace中解压

<span class="line"><span class="title">tar</span> -zxvf scala-<span class="number">2</span>.<span class="number">10</span>.<span class="number">4</span>.tgz</span><br>

再次修改环境变量sudo vi /etc/profile,添加以下内容:

<span class="line"><span class="built_in">export</span> SCALA_HOME=<span class="variable">$WORK_SPACE</span>/scala-<span class="number">2.10</span>.<span class="number">4</span></span><br><span class="line"><span class="built_in">export</span> PATH=<span class="variable">$PATH</span>:<span class="variable">$SCALA_HOME</span>/bin</span><br>

同样的方法使环境变量生效,并验证 scala 是否安装成功

<span class="line">$ source /etc/profile   <span class="comment">#生效环境变量</span></span><br><span class="line">$ scala -<span class="property">version</span>        <span class="comment">#如果打印出如下版本信息,则说明安装成功</span></span><br><span class="line">Scala code runner <span class="property">version</span> <span class="number">2.10</span>.4 <span class="comment">-- Copyright 2002-2013, LAMP/EPFL</span></span><br>

安装配置 Hadoop YARN

下载解压

从官网下载 hadoop2.6.0 版本,这里给个我们学校的镜像下载地址。

同样我们在~/workspace中解压

<span class="line">tar -zxvf hadoop-<span class="number">2.6</span>.<span class="number">0</span><span class="class">.tar</span><span class="class">.gz</span></span><br>

配置 Hadoop

cd ~/workspace/hadoop-2.6.0/etc/hadoop进入hadoop配置目录,需要配置有以下7个文件:hadoop-env.shyarn-env.shslavescore-site.xmlhdfs-site.xmlmaprd-site.xmlyarn-site.xml

  1. hadoop-env.sh中配置JAVA_HOME

    <span class="line"><span class="comment"># The java implementation to use.</span></span><br><span class="line">export <span class="constant">JAVA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/jdk</span>1.<span class="number">7.0_75</span></span><br>
  2. yarn-env.sh中配置JAVA_HOME

    <span class="line"><span class="comment"># some Java parameters</span></span><br><span class="line">export <span class="constant">JAVA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/jdk</span>1.<span class="number">7.0_75</span></span><br>
  3. slaves中配置slave节点的ip或者host,

    <span class="line"><span class="title">slave1</span></span><br><span class="line">slave2</span><br>
  4. 修改core-site.xml

    <span class="line"><span class="variable">&lt;configuration&gt;</span></span><br><span class="line">    <span class="variable">&lt;property&gt;</span></span><br><span class="line">        <span class="variable">&lt;name&gt;</span>fs.<span class="keyword">default</span>FS<span class="variable">&lt;/name&gt;</span></span><br><span class="line">        <span class="variable">&lt;value&gt;</span>hdfs://master:<span class="number">9000</span>/<span class="variable">&lt;/value&gt;</span></span><br><span class="line">    <span class="variable">&lt;/property&gt;</span></span><br><span class="line">    <span class="variable">&lt;property&gt;</span></span><br><span class="line">         <span class="variable">&lt;name&gt;</span>hadoop.tmp.dir<span class="variable">&lt;/name&gt;</span></span><br><span class="line">         <span class="variable">&lt;value&gt;</span>file:/home/spark/workspace/hadoop-<span class="number">2.6</span>.<span class="number">0</span>/tmp<span class="variable">&lt;/value&gt;</span></span><br><span class="line">    <span class="variable">&lt;/property&gt;</span></span><br><span class="line"><span class="variable">&lt;/configuration&gt;</span></span><br>
  5. 修改hdfs-site.xml

    <span class="line"><span class="tag">&lt;<span class="title">configuration</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>dfs.namenode.secondary.http-address<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>master:9001<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>dfs.namenode.name.dir<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>file:/home/spark/workspace/hadoop-2.6.0/dfs/name<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>dfs.datanode.data.dir<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>file:/home/spark/workspace/hadoop-2.6.0/dfs/data<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>dfs.replication<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>3<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;/<span class="title">configuration</span>&gt;</span></span><br>
  6. 修改mapred-site.xml

    <span class="line"><span class="tag">&lt;<span class="title">configuration</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>mapreduce.framework.name<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>yarn<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;/<span class="title">configuration</span>&gt;</span></span><br>
  7. 修改yarn-site.xml

    <span class="line"><span class="tag">&lt;<span class="title">configuration</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>yarn.nodemanager.aux-services<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>mapreduce_shuffle<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>yarn.nodemanager.aux-services.mapreduce.shuffle.class<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>org.apache.hadoop.mapred.ShuffleHandler<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>yarn.resourcemanager.address<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>master:8032<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>yarn.resourcemanager.scheduler.address<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>master:8030<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>yarn.resourcemanager.resource-tracker.address<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>master:8035<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>yarn.resourcemanager.admin.address<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>master:8033<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;<span class="title">property</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">name</span>&gt;</span>yarn.resourcemanager.webapp.address<span class="tag">&lt;/<span class="title">name</span>&gt;</span></span><br><span class="line">        <span class="tag">&lt;<span class="title">value</span>&gt;</span>master:8088<span class="tag">&lt;/<span class="title">value</span>&gt;</span></span><br><span class="line">    <span class="tag">&lt;/<span class="title">property</span>&gt;</span></span><br><span class="line"><span class="tag">&lt;/<span class="title">configuration</span>&gt;</span></span><br>

将配置好的hadoop-2.6.0文件夹分发给所有slaves吧

<span class="line">scp -r ~<span class="regexp">/workspace/hadoop</span>-<span class="number">2.6</span>.<span class="number">0</span> spark<span class="variable">@slave1</span><span class="symbol">:~/workspace/</span></span><br>

启动 Hadoop

在 master 上执行以下操作,就可以启动 hadoop 了。

<span class="line">cd ~/workspace/hadoop-<span class="number">2.6</span>.0     <span class="comment">#进入hadoop目录</span></span><br><span class="line">bin/hadoop namenode -<span class="built_in">format</span>     <span class="comment">#格式化namenode</span></span><br><span class="line">sbin/<span class="built_in">start</span>-dfs.sh               <span class="comment">#启动dfs </span></span><br><span class="line">sbin/<span class="built_in">start</span>-yarn.sh              <span class="comment">#启动yarn</span></span><br>

验证 Hadoop 是否安装成功

可以通过jps命令查看各个节点启动的进程是否正常。在 master 上应该有以下几个进程:

<span class="line"><span class="variable">$ </span>jps  <span class="comment">#run on master</span></span><br><span class="line"><span class="number">3407</span> <span class="constant">SecondaryNameNode</span></span><br><span class="line"><span class="number">3218</span> <span class="constant">NameNode</span></span><br><span class="line"><span class="number">3552</span> <span class="constant">ResourceManager</span></span><br><span class="line"><span class="number">3910</span> <span class="constant">Jps</span></span><br>

在每个slave上应该有以下几个进程:

<span class="line"><span class="variable">$ </span>jps   <span class="comment">#run on slaves</span></span><br><span class="line"><span class="number">2072</span> <span class="constant">NodeManager</span></span><br><span class="line"><span class="number">2213</span> <span class="constant">Jps</span></span><br><span class="line"><span class="number">1962</span> <span class="constant">DataNode</span></span><br>

或者在浏览器中输入 http://master:8088 ,应该有 hadoop 的管理界面出来了,并能看到 slave1 和 slave2 节点。

Spark安装

下载解压

进入官方下载地址下载最新版 Spark。我下载的是 spark-1.3.0-bin-hadoop2.4.tgz。

~/workspace目录下解压

<span class="line"><span class="title">tar</span> -zxvf spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span>-bin-hadoop2.<span class="number">4</span>.tgz</span><br><span class="line">mv spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span>-bin-hadoop2.<span class="number">4</span> spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span>    <span class="comment">#原来的文件名太长了,修改下</span></span><br>

配置 Spark

<span class="line">cd ~/workspace/spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span>/conf    <span class="comment">#进入spark配置目录</span></span><br><span class="line">cp spark-env.sh.<span class="keyword">template</span> spark-env.sh   <span class="comment">#从配置模板复制</span></span><br><span class="line">vi spark-env.sh     <span class="comment">#添加配置内容</span></span><br>

spark-env.sh末尾添加以下内容(这是我的配置,你可以自行修改):

<span class="line">export <span class="constant">SCALA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/scala</span>-<span class="number">2.10</span>.<span class="number">4</span></span><br><span class="line">export <span class="constant">JAVA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/jdk</span>1.<span class="number">7.0_75</span></span><br><span class="line">export <span class="constant">HADOOP_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/hadoop</span>-<span class="number">2.6</span>.<span class="number">0</span></span><br><span class="line">export <span class="constant">HADOOP_CONF_DIR</span>=<span class="variable">$HADOOP_HOME</span>/etc/hadoop</span><br><span class="line"><span class="constant">SPARK_MASTER_IP</span>=master</span><br><span class="line"><span class="constant">SPARK_LOCAL_DIRS</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/spark</span>-<span class="number">1.3</span>.<span class="number">0</span></span><br><span class="line"><span class="constant">SPARK_DRIVER_MEMORY</span>=<span class="number">1</span>G</span><br>

注:在设置Worker进程的CPU个数和内存大小,要注意机器的实际硬件条件,如果配置的超过当前Worker节点的硬件条件,Worker进程会启动失败。

vi slaves在slaves文件下填上slave主机名:

<span class="line"><span class="title">slave1</span></span><br><span class="line">slave2</span><br>

将配置好的spark-1.3.0文件夹分发给所有slaves吧

<span class="line">scp -r ~<span class="regexp">/workspace/spark</span>-<span class="number">1.3</span>.<span class="number">0</span> spark<span class="variable">@slave1</span><span class="symbol">:~/workspace/</span></span><br>

启动Spark

<span class="line">sbin/<span class="operator"><span class="keyword">start</span>-<span class="keyword">all</span>.sh</span></span><br>

验证 Spark 是否安装成功

jps检查,在 master 上应该有以下几个进程:

<span class="line"><span class="variable">$ </span>jps</span><br><span class="line"><span class="number">7949</span> <span class="constant">Jps</span></span><br><span class="line"><span class="number">7328</span> <span class="constant">SecondaryNameNode</span></span><br><span class="line"><span class="number">7805</span> <span class="constant">Master</span></span><br><span class="line"><span class="number">7137</span> <span class="constant">NameNode</span></span><br><span class="line"><span class="number">7475</span> <span class="constant">ResourceManager</span></span><br>

在 slave 上应该有以下几个进程:

<span class="line"><span class="variable">$jps</span></span><br><span class="line"><span class="number">3132</span> DataNode</span><br><span class="line"><span class="number">3759</span> Worker</span><br><span class="line"><span class="number">3858</span> Jps</span><br><span class="line"><span class="number">3231</span> NodeManager</span><br>

进入Spark的Web管理页面: http://master:8080

运行示例

<span class="line">#本地模式两线程运行</span><br><span class="line">./bin/run-example SparkPi 10 --master local[2]</span><br><span class="line"></span><br><span class="line">#Spark Standalone 集群模式运行</span><br><span class="line">./bin/spark-submit \</span><br><span class="line">  -<span class="ruby">-<span class="class"><span class="keyword">class</span> <span class="title">org</span>.<span class="title">apache</span>.<span class="title">spark</span>.<span class="title">examples</span>.<span class="title">SparkPi</span> \</span></span><br><span class="line"></span>  -<span class="ruby">-master <span class="symbol">spark:</span>/<span class="regexp">/master:7077 \</span><br><span class="line"></span></span>  lib/spark-examples-1.3.0-hadoop2.4.0.jar \</span><br><span class="line">  100</span><br><span class="line"></span><br><span class="line">#Spark on YARN 集群上 yarn-cluster 模式运行</span><br><span class="line">./bin/spark-submit \</span><br><span class="line">    -<span class="ruby">-<span class="class"><span class="keyword">class</span> <span class="title">org</span>.<span class="title">apache</span>.<span class="title">spark</span>.<span class="title">examples</span>.<span class="title">SparkPi</span> \</span></span><br><span class="line"></span>    -<span class="ruby">-master yarn-cluster \  <span class="comment"># can also be `yarn-client`</span></span><br><span class="line"></span>    lib/spark-examples*.jar \</span><br><span class="line">    10</span><br>

注意 Spark on YARN 支持两种运行模式,分别为yarn-clusteryarn-client,具体的区别可以看这篇博文,从广义上讲,yarn-cluster适用于生产环境;而yarn-client适用于交互和调试,也就是希望快速地看到application的输出。

来源URL:http://wuchong.me/blog/2015/04/04/spark-on-yarn-cluster-deploy/