Spark集群
Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。
spark环境安装、测试
Spark安装配置
2.1 下载Spark
由系统提供的hadoop版本预先构建 下载地址 https://mirror.bit.edu.cn/apache/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop.tgz
cd /home/meizhangzheng
wget https://mirror.bit.edu.cn/apache/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop.tgz
2.2 安装spark,解压到data目录
tar -zxvf spark-2.4.5-bin-without-hadoop.tgz -C /home/meizhangzheng/data/
2.3 配置spark 环境
spark中涉及到的配置文件:
${SPARK_HOME}/conf/spark-env.sh
${SPARK_HOME}/conf/slaves
${SPARK_HOME}/conf/spark-defaults.conf
2.3.1 spark-env.sh 配置
(1) cd /home/meizhangzheng/data/spark-2.4.5-bin-without-hadoop/conf
(2) cp spark-env.sh.template spark-env.sh
(3) 配置变量
JAVA_HOME=/home/meizhangzheng/jdk1.8.0_231
SCALA_HOME=/home/meizhangzheng/data/scala-2.13.0
SPARK_MASTER_IP=192.168.1.10
HADOOP_CONF_DIR=/home/meizhangzheng/data/hadoop-2.10.0/etc/hadoop
######### data store dir # shuffled以及RDD的数据存放目录,用于写中间数据
SPARK_LOCAL_DIRS=/home/meizhangzheng/data/spark_data
######### worker process dir,contain log ,tmp store space。
######### worker端进程的工作目录,包括worker的日志以及临时存储空间,默认:${SPARK_HOME}/work
SPARK_WORKER_DIR=/home/meizhangzheng/data/spark_data/spark_works
-- 特殊配置
HADOOP_HOME=/home/meizhangzheng/data/hadoop-2.10.0
export SPARK_DIST_CLASSPATH=$(/home/meizhangzheng/data/hadoop-2.10.0/bin/hadoop classpath)
(4) 创建 spark_works 目录
mkdir -p /home/meizhangzheng/data/spark_data/spark_works
2.3.2 slaves 配置
(1) cd /home/meizhangzheng/data/spark-2.4.5-bin-without-hadoop/conf
(2) scp slaves.template slaves
将localhost 修改为
master
slave1
slave2
将从这三台主机启动spark程序。
2.3.3 spark-default.conf 配置
(1) cd /home/meizhangzheng/data/spark-2.4.5-bin-without-hadoop/conf
(2) scp spark-defaults.conf.template spark-defaults.conf
####### add config info
spark.master spark://master:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled true
spark.eventLog.dir file:///data/spark_data/history/event-log
spark.history.fs.logDirectory file:///data/spark_data/history/spark-events
spark.eventLog.compress true
[2020-07-16]
备注:file:///data/spark_data/history/event-log 该目录 是在 /data 目录下
[meizhangzheng@master /]$ ll
total 24
lrwxrwxrwx. 1 root root 7 Dec 14 2019 bin -> usr/bin
dr-xr-xr-x. 5 root root 4096 Dec 14 2019 boot
drwxr-xr-x 3 meizhangzheng meizhangzheng 24 Jul 16 06:54 data
drwxr-xr-x 20 root root 3300 Jul 14 06:56 dev
drwxr-xr-x. 145 root root 8192 Jul 16 06:45 etc
drwxr-xr-x. 3 root root 27 Dec 31 2019 home
lrwxrwxrwx. 1 root root 7 Dec 14 2019 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Dec 14 2019 lib64 -> usr/lib64
drwxr-xr-x. 2 root root 6 Apr 11 2018 media
drwxr-xr-x. 2 root root 6 Apr 11 2018 mnt
drwxr-xr-x. 3 root root 16 Dec 14 2019 opt
dr-xr-xr-x 254 root root 0 Jul 14 06:56 proc
dr-xr-x---. 7 root root 267 Apr 30 08:09 root
drwxr-xr-x 41 root root 1220 Jul 16 06:47 run
lrwxrwxrwx. 1 root root 8 Dec 14 2019 sbin -> usr/sbin
drwxr-xr-x. 2 root root 6 Apr 11 2018 srv
dr-xr-xr-x 13 root root 0 Jul 16 07:43 sys
drwxrwxrwt. 22 root root 4096 Jul 16 07:53 tmp
drwxr-xr-x. 13 root root 155 Dec 14 2019 usr
drwxr-xr-x. 21 root root 4096 Dec 14 2019 var
[meizhangzheng@master /]$
2.3.4 配置spark 环境变量
(1) cd /home/meizhangzheng
(2) sudo vim .bash_profile
####### 2020.04.27 add spark path
export SPARK_HOME=/home/meizhangzheng/data/spark-2.4.5-bin-without-hadoop
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
2.3.5 分发spark 到slave1、slave2节点
(1) scp -r data/spark-2.4.5-bin-without-hadoop/ meizhangzheng@slave1:~/data/
(2) scp -r data/spark-2.4.5-bin-without-hadoop/ meizhangzheng@slave2:~/data/
(3) scp -r .bash_profile meizhangzheng@slave1:~
(4) scp -r .bash_profile meizhangzheng@slave2:~
(5) 使slave1、slav2 .bash_profile配置生效
ssh meizhangzheng@slave1
source .bash_profile
ssh meizhangzheng@slave1
source .bash_profile
2.4.1 启动验证
<1.启动服务
start-master.sh
start-slaves.sh
<2.查看集群
http://192.168.1.10:8080
<3·. 出现 spark-2.4.5-bin-without-hadoop 启动报错
failed to launch: nice -n 0 log jar 未找到的异常
参考文档
参考spark-2.4.3-bin-without-hadoop的安装
在spark-env.sh 中添加如下配置 HADOOP_HOME=/home/meizhangzheng/data/hadoop-2.10.0 export SPARK_DIST_CLASSPATH=$(/home/meizhangzheng/data/hadoop-2.10.0/bin/hadoop classpath) #不添加,就会报错
ERROR spark.SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/data/spark_data/history/event-log does not exist
解决办法:创建文件夹
/data/spark_data/history/event-log /data/spark_data/history/spark-events
[meizhangzheng@slave2 ~]$ mkdir -p /data/spark_data/history/event-log
注意: 该目录是指 如下目录:
[meizhangzheng@salve1 /]$ pwd
/
[meizhangzheng@salve1 /]$
total 24
lrwxrwxrwx. 1 root root 7 Dec 14 2019 bin -> usr/bin
dr-xr-xr-x. 5 root root 4096 Dec 14 2019 boot
drwxr-xr-x 3 meizhangzheng meizhangzheng 24 Jul 16 07:40 data
drwxr-xr-x 20 root root 3300 Jul 14 06:56 dev
drwxr-xr-x. 145 root root 8192 Jul 16 06:45 etc
drwxr-xr-x. 3 root root 27 Dec 14 2019 home
lrwxrwxrwx. 1 root root 7 Dec 14 2019 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Dec 14 2019 lib64 -> usr/lib64
drwxr-xr-x. 2 root root 6 Apr 11 2018 media
drwxr-xr-x. 2 root root 6 Apr 11 2018 mnt
drwxr-xr-x. 3 root root 16 Dec 14 2019 opt
dr-xr-xr-x 237 root root 0 Jul 14 06:56 proc
dr-xr-x---. 7 root root 267 May 8 07:18 root
drwxr-xr-x 41 root root 1220 Jul 16 06:45 run
lrwxrwxrwx. 1 root root 8 Dec 14 2019 sbin -> usr/sbin
drwxr-xr-x. 2 root root 6 Apr 11 2018 srv
dr-xr-xr-x 13 root root 0 Jul 16 08:03 sys
drwxrwxrwt. 17 root root 4096 Jul 16 08:03 tmp
drwxr-xr-x. 13 root root 155 Dec 14 2019 usr
drwxr-xr-x. 21 root root 4096 Dec 14 2019 var
未使用 [meizhangzheng@master tmp]$ hadoop fs -touchz /data/spark_data/history/event-log [meizhangzheng@master tmp]$ hadoop fs -touchz /data/spark_data/history/spark-events