最近毕设需要用到 Spark 集群,所以就记录下了部署的过程。我们知道 Spark 官方提供了三种集群部署方案: Standalone, Mesos, YARN。其中 Standalone 最为方便,本文主要讲述结合 YARN 的部署方案。
软件环境:
Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-32-generic x86_64)
Hadoop: 2.6.0
Spark: 1.3.0
0 写在前面
本例中的演示均为非 root 权限,所以有些命令行需要加 sudo,如果你是 root 身份运行,请忽略 sudo。下载安装的软件建议都放在 home 目录之上,比如~/workspace
中,这样比较方便,以免权限问题带来不必要的麻烦。
1. 环境准备
修改主机名
我们将搭建1个master,2个slave的集群方案。首先修改主机名vi /etc/hostname
,在master上修改为master
,其中一个slave上修改为slave1
,另一个同理。
配置hosts
在每台主机上修改host文件
<span class="line"><span class="title">vi</span> /etc/hosts</span><br><span class="line"></span><br><span class="line"><span class="number">10.1.1.107</span> master</span><br><span class="line"><span class="number">10.1.1.108</span> slave1</span><br><span class="line"><span class="number">10.1.1.109</span> slave2</span><br> |
配置之后ping一下用户名看是否生效
<span class="line"><span class="built_in">ping</span> slave1</span><br><span class="line"><span class="built_in">ping</span> slave2</span><br> |
SSH 免密码登录
安装Openssh server
<span class="line">sudo apt-get <span class="operator"><span class="keyword">install</span> openssh-<span class="keyword">server</span></span></span><br> |
在所有机器上都生成私钥和公钥
<span class="line">ssh-keygen -t rsa <span class="preprocessor">#一路回车</span></span><br> |
需要让机器间都能相互访问,就把每个机子上的id_rsa.pub
发给master节点,传输公钥可以用scp来传输。
<span class="line">scp ~/.ssh/id_rsa<span class="class">.pub</span> spark@master:~/.ssh/id_rsa<span class="class">.pub</span><span class="class">.slave1</span></span><br> |
在master上,将所有公钥加到用于认证的公钥文件authorized_keys
中
<span class="line">cat ~<span class="regexp">/.ssh/id</span>_rsa.pub* <span class="prompt">>> </span>~<span class="regexp">/.ssh/authorized</span>_keys</span><br> |
将公钥文件authorized_keys
分发给每台slave
<span class="line">scp ~<span class="regexp">/.ssh/authorized</span>_keys spark<span class="variable">@master</span><span class="symbol">:~/</span>.ssh/</span><br> |
在每台机子上验证SSH无密码通信
<span class="line"><span class="title">ssh</span> master</span><br><span class="line">ssh slave1</span><br><span class="line">ssh slave2</span><br> |
如果登陆测试不成功,则可能需要修改文件authorized_keys的权限(权限的设置非常重要,因为不安全的设置安全设置,会让你不能使用RSA功能 )
<span class="line">chmod <span class="number">600</span> ~<span class="regexp">/.ssh/authorized</span>_keys</span><br> |
安装 Java
从官网下载最新版 Java 就可以,Spark官方说明 Java 只要是6以上的版本都可以,我下的是 jdk-7u75-linux-x64.gz
在~/workspace
目录下直接解压
<span class="line">tar -zxvf jdk-<span class="number">7u</span>75-linux-x64.gz</span><br> |
修改环境变量sudo vi /etc/profile
,添加下列内容,注意将home路径替换成你的:
<span class="line"><span class="built_in">export</span> WORK_SPACE=/home/spark/workspace/</span><br><span class="line"><span class="built_in">export</span> JAVA_HOME=<span class="variable">$WORK_SPACE</span>/jdk1.<span class="number">7.0</span>_75</span><br><span class="line"><span class="built_in">export</span> JRE_HOME=/home/spark/work/jdk1.<span class="number">7.0</span>_75/jre</span><br><span class="line"><span class="built_in">export</span> PATH=<span class="variable">$JAVA_HOME</span>/bin:<span class="variable">$JAVA_HOME</span>/jre/bin:<span class="variable">$PATH</span></span><br><span class="line"><span class="built_in">export</span> CLASSPATH=<span class="variable">$CLASSPATH</span>:.:<span class="variable">$JAVA_HOME</span>/lib:<span class="variable">$JAVA_HOME</span>/jre/lib</span><br> |
然后使环境变量生效,并验证 Java 是否安装成功
<span class="line">$ source /etc/profile #生效环境变量</span><br><span class="line">$ java -version #如果打印出如下版本信息,则说明安装成功</span><br><span class="line">java version <span class="string">"1.7.0_75"</span></span><br><span class="line"><span class="function"><span class="title">Java</span><span class="params">(TM)</span></span> SE Runtime Environment (build <span class="number">1.7</span>.<span class="number">0</span>_75-b13)</span><br><span class="line">Java <span class="function"><span class="title">HotSpot</span><span class="params">(TM)</span></span> <span class="number">64</span>-Bit Server VM (build <span class="number">24.75</span>-b04, mixed mode)</span><br> |
安装 Scala
Spark官方要求 Scala 版本为 2.10.x,注意不要下错版本,我这里下了 2.10.4,官方下载地址(可恶的天朝大局域网下载 Scala 龟速一般)。
同样我们在~/workspace
中解压
<span class="line"><span class="title">tar</span> -zxvf scala-<span class="number">2</span>.<span class="number">10</span>.<span class="number">4</span>.tgz</span><br> |
再次修改环境变量sudo vi /etc/profile
,添加以下内容:
<span class="line"><span class="built_in">export</span> SCALA_HOME=<span class="variable">$WORK_SPACE</span>/scala-<span class="number">2.10</span>.<span class="number">4</span></span><br><span class="line"><span class="built_in">export</span> PATH=<span class="variable">$PATH</span>:<span class="variable">$SCALA_HOME</span>/bin</span><br> |
同样的方法使环境变量生效,并验证 scala 是否安装成功
<span class="line">$ source /etc/profile <span class="comment">#生效环境变量</span></span><br><span class="line">$ scala -<span class="property">version</span> <span class="comment">#如果打印出如下版本信息,则说明安装成功</span></span><br><span class="line">Scala code runner <span class="property">version</span> <span class="number">2.10</span>.4 <span class="comment">-- Copyright 2002-2013, LAMP/EPFL</span></span><br> |
安装配置 Hadoop YARN
下载解压
从官网下载 hadoop2.6.0 版本,这里给个我们学校的镜像下载地址。
同样我们在~/workspace
中解压
<span class="line">tar -zxvf hadoop-<span class="number">2.6</span>.<span class="number">0</span><span class="class">.tar</span><span class="class">.gz</span></span><br> |
配置 Hadoop
cd ~/workspace/hadoop-2.6.0/etc/hadoop
进入hadoop配置目录,需要配置有以下7个文件:hadoop-env.sh
,yarn-env.sh
,slaves
,core-site.xml
,hdfs-site.xml
,maprd-site.xml
,yarn-site.xml
-
在
hadoop-env.sh
中配置JAVA_HOME<span class="line"><span class="comment"># The java implementation to use.</span></span><br><span class="line">export <span class="constant">JAVA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/jdk</span>1.<span class="number">7.0_75</span></span><br>
-
在
yarn-env.sh
中配置JAVA_HOME<span class="line"><span class="comment"># some Java parameters</span></span><br><span class="line">export <span class="constant">JAVA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/jdk</span>1.<span class="number">7.0_75</span></span><br>
-
在
slaves
中配置slave节点的ip或者host,<span class="line"><span class="title">slave1</span></span><br><span class="line">slave2</span><br>
-
修改
core-site.xml
<span class="line"><span class="variable"><configuration></span></span><br><span class="line"> <span class="variable"><property></span></span><br><span class="line"> <span class="variable"><name></span>fs.<span class="keyword">default</span>FS<span class="variable"></name></span></span><br><span class="line"> <span class="variable"><value></span>hdfs://master:<span class="number">9000</span>/<span class="variable"></value></span></span><br><span class="line"> <span class="variable"></property></span></span><br><span class="line"> <span class="variable"><property></span></span><br><span class="line"> <span class="variable"><name></span>hadoop.tmp.dir<span class="variable"></name></span></span><br><span class="line"> <span class="variable"><value></span>file:/home/spark/workspace/hadoop-<span class="number">2.6</span>.<span class="number">0</span>/tmp<span class="variable"></value></span></span><br><span class="line"> <span class="variable"></property></span></span><br><span class="line"><span class="variable"></configuration></span></span><br>
-
修改
hdfs-site.xml
<span class="line"><span class="tag"><<span class="title">configuration</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>dfs.namenode.secondary.http-address<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>master:9001<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>dfs.namenode.name.dir<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>file:/home/spark/workspace/hadoop-2.6.0/dfs/name<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>dfs.datanode.data.dir<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>file:/home/spark/workspace/hadoop-2.6.0/dfs/data<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>dfs.replication<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>3<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"><span class="tag"></<span class="title">configuration</span>></span></span><br>
-
修改
mapred-site.xml
<span class="line"><span class="tag"><<span class="title">configuration</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>mapreduce.framework.name<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>yarn<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"><span class="tag"></<span class="title">configuration</span>></span></span><br>
-
修改
yarn-site.xml
<span class="line"><span class="tag"><<span class="title">configuration</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>yarn.nodemanager.aux-services<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>mapreduce_shuffle<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>yarn.nodemanager.aux-services.mapreduce.shuffle.class<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>org.apache.hadoop.mapred.ShuffleHandler<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>yarn.resourcemanager.address<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>master:8032<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>yarn.resourcemanager.scheduler.address<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>master:8030<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>yarn.resourcemanager.resource-tracker.address<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>master:8035<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>yarn.resourcemanager.admin.address<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>master:8033<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">property</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">name</span>></span>yarn.resourcemanager.webapp.address<span class="tag"></<span class="title">name</span>></span></span><br><span class="line"> <span class="tag"><<span class="title">value</span>></span>master:8088<span class="tag"></<span class="title">value</span>></span></span><br><span class="line"> <span class="tag"></<span class="title">property</span>></span></span><br><span class="line"><span class="tag"></<span class="title">configuration</span>></span></span><br>
将配置好的hadoop-2.6.0
文件夹分发给所有slaves吧
<span class="line">scp -r ~<span class="regexp">/workspace/hadoop</span>-<span class="number">2.6</span>.<span class="number">0</span> spark<span class="variable">@slave1</span><span class="symbol">:~/workspace/</span></span><br> |
启动 Hadoop
在 master 上执行以下操作,就可以启动 hadoop 了。
<span class="line">cd ~/workspace/hadoop-<span class="number">2.6</span>.0 <span class="comment">#进入hadoop目录</span></span><br><span class="line">bin/hadoop namenode -<span class="built_in">format</span> <span class="comment">#格式化namenode</span></span><br><span class="line">sbin/<span class="built_in">start</span>-dfs.sh <span class="comment">#启动dfs </span></span><br><span class="line">sbin/<span class="built_in">start</span>-yarn.sh <span class="comment">#启动yarn</span></span><br> |
验证 Hadoop 是否安装成功
可以通过jps
命令查看各个节点启动的进程是否正常。在 master 上应该有以下几个进程:
<span class="line"><span class="variable">$ </span>jps <span class="comment">#run on master</span></span><br><span class="line"><span class="number">3407</span> <span class="constant">SecondaryNameNode</span></span><br><span class="line"><span class="number">3218</span> <span class="constant">NameNode</span></span><br><span class="line"><span class="number">3552</span> <span class="constant">ResourceManager</span></span><br><span class="line"><span class="number">3910</span> <span class="constant">Jps</span></span><br> |
在每个slave上应该有以下几个进程:
<span class="line"><span class="variable">$ </span>jps <span class="comment">#run on slaves</span></span><br><span class="line"><span class="number">2072</span> <span class="constant">NodeManager</span></span><br><span class="line"><span class="number">2213</span> <span class="constant">Jps</span></span><br><span class="line"><span class="number">1962</span> <span class="constant">DataNode</span></span><br> |
或者在浏览器中输入 http://master:8088 ,应该有 hadoop 的管理界面出来了,并能看到 slave1 和 slave2 节点。
Spark安装
下载解压
进入官方下载地址下载最新版 Spark。我下载的是 spark-1.3.0-bin-hadoop2.4.tgz。
在~/workspace
目录下解压
<span class="line"><span class="title">tar</span> -zxvf spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span>-bin-hadoop2.<span class="number">4</span>.tgz</span><br><span class="line">mv spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span>-bin-hadoop2.<span class="number">4</span> spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span> <span class="comment">#原来的文件名太长了,修改下</span></span><br> |
配置 Spark
<span class="line">cd ~/workspace/spark-<span class="number">1</span>.<span class="number">3</span>.<span class="number">0</span>/conf <span class="comment">#进入spark配置目录</span></span><br><span class="line">cp spark-env.sh.<span class="keyword">template</span> spark-env.sh <span class="comment">#从配置模板复制</span></span><br><span class="line">vi spark-env.sh <span class="comment">#添加配置内容</span></span><br> |
在spark-env.sh
末尾添加以下内容(这是我的配置,你可以自行修改):
<span class="line">export <span class="constant">SCALA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/scala</span>-<span class="number">2.10</span>.<span class="number">4</span></span><br><span class="line">export <span class="constant">JAVA_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/jdk</span>1.<span class="number">7.0_75</span></span><br><span class="line">export <span class="constant">HADOOP_HOME</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/hadoop</span>-<span class="number">2.6</span>.<span class="number">0</span></span><br><span class="line">export <span class="constant">HADOOP_CONF_DIR</span>=<span class="variable">$HADOOP_HOME</span>/etc/hadoop</span><br><span class="line"><span class="constant">SPARK_MASTER_IP</span>=master</span><br><span class="line"><span class="constant">SPARK_LOCAL_DIRS</span>=<span class="regexp">/home/spark</span><span class="regexp">/workspace/spark</span>-<span class="number">1.3</span>.<span class="number">0</span></span><br><span class="line"><span class="constant">SPARK_DRIVER_MEMORY</span>=<span class="number">1</span>G</span><br> |
注:在设置Worker进程的CPU个数和内存大小,要注意机器的实际硬件条件,如果配置的超过当前Worker节点的硬件条件,Worker进程会启动失败。
vi slaves
在slaves文件下填上slave主机名:
<span class="line"><span class="title">slave1</span></span><br><span class="line">slave2</span><br> |
将配置好的spark-1.3.0
文件夹分发给所有slaves吧
<span class="line">scp -r ~<span class="regexp">/workspace/spark</span>-<span class="number">1.3</span>.<span class="number">0</span> spark<span class="variable">@slave1</span><span class="symbol">:~/workspace/</span></span><br> |
启动Spark
<span class="line">sbin/<span class="operator"><span class="keyword">start</span>-<span class="keyword">all</span>.sh</span></span><br> |
验证 Spark 是否安装成功
用jps
检查,在 master 上应该有以下几个进程:
<span class="line"><span class="variable">$ </span>jps</span><br><span class="line"><span class="number">7949</span> <span class="constant">Jps</span></span><br><span class="line"><span class="number">7328</span> <span class="constant">SecondaryNameNode</span></span><br><span class="line"><span class="number">7805</span> <span class="constant">Master</span></span><br><span class="line"><span class="number">7137</span> <span class="constant">NameNode</span></span><br><span class="line"><span class="number">7475</span> <span class="constant">ResourceManager</span></span><br> |
在 slave 上应该有以下几个进程:
<span class="line"><span class="variable">$jps</span></span><br><span class="line"><span class="number">3132</span> DataNode</span><br><span class="line"><span class="number">3759</span> Worker</span><br><span class="line"><span class="number">3858</span> Jps</span><br><span class="line"><span class="number">3231</span> NodeManager</span><br> |
进入Spark的Web管理页面: http://master:8080
运行示例
<span class="line">#本地模式两线程运行</span><br><span class="line">./bin/run-example SparkPi 10 --master local[2]</span><br><span class="line"></span><br><span class="line">#Spark Standalone 集群模式运行</span><br><span class="line">./bin/spark-submit \</span><br><span class="line"> -<span class="ruby">-<span class="class"><span class="keyword">class</span> <span class="title">org</span>.<span class="title">apache</span>.<span class="title">spark</span>.<span class="title">examples</span>.<span class="title">SparkPi</span> \</span></span><br><span class="line"></span> -<span class="ruby">-master <span class="symbol">spark:</span>/<span class="regexp">/master:7077 \</span><br><span class="line"></span></span> lib/spark-examples-1.3.0-hadoop2.4.0.jar \</span><br><span class="line"> 100</span><br><span class="line"></span><br><span class="line">#Spark on YARN 集群上 yarn-cluster 模式运行</span><br><span class="line">./bin/spark-submit \</span><br><span class="line"> -<span class="ruby">-<span class="class"><span class="keyword">class</span> <span class="title">org</span>.<span class="title">apache</span>.<span class="title">spark</span>.<span class="title">examples</span>.<span class="title">SparkPi</span> \</span></span><br><span class="line"></span> -<span class="ruby">-master yarn-cluster \ <span class="comment"># can also be `yarn-client`</span></span><br><span class="line"></span> lib/spark-examples*.jar \</span><br><span class="line"> 10</span><br> |
注意 Spark on YARN 支持两种运行模式,分别为yarn-cluster
和yarn-client
,具体的区别可以看这篇博文,从广义上讲,yarn-cluster适用于生产环境;而yarn-client适用于交互和调试,也就是希望快速地看到application的输出。
来源URL:http://wuchong.me/blog/2015/04/04/spark-on-yarn-cluster-deploy/