Hadoop+Spark集群部署
本次設計共四節點:1個master+3個slave。
下面的 JDK 和 Hadoop 安裝配置操作都是使用普通用戶 hadoop 來執行,並非 root。
集群環境準備
每臺機器配置hostname、hosts
hostnamectl set-hostname hadoop2
echo "# Hadoop
192.100.3.254 hadoop0
192.100.3.253 hadoop1
192.100.3.252 hadoop2
192.100.3.251 hadoop3" >> /etc/hosts;
新建hadoop用戶
useradd -m -s /bin/bash hadoop
以下均以Hadoop執行
需將/opt
、/usr/local/java/
等目錄所有者改為hadoop
su - hadoop
#master
ssh-keygen -t rsa -C “hadoop0” -P “”
#slave1
ssh-keygen -t rsa -C “hadoop1” -P “”
#slave2
ssh-keygen -t rsa -C “hadoop2” -P “”
#slave3
ssh-keygen -t rsa -C “hadoop3” -P “”
各節點配置免密登錄
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop0
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop3
JDK 安裝與配置
JDK手動安裝到/usr/lib/java
#切換到hadoop用戶
sudo - hadoop
ln -sf /usr/lib/java/jdk1.8.0_331/ /usr/lib/java/jdk
JDK環境變數的配置
vi /etc/profile.d/java.sh
#JDK environment
export JAVA_HOME=/usr/lib/java/jdk
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATh=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
使profile生效,驗證java版本
source /etc/profile
java -version
Hadoop 部署與配置
hadoop安裝包到/opt
;修改所有者;配置軟鏈接;
chown -R hadoop:hadoop /opt/hadoop-3.2.3
ln -sf /opt/hadoop-3.2.3 /opt/hadoop
配置日誌路徑
/opt/hadoop-3.2.3/etc/hadoop/log4j.properties
mkdir /opt/hadoop/logs
mkdir -p /opt/hadoop/hdfs/name
mkdir -p /opt/hadoop/hdfs/data
nano /etc/profile.d/hadoop.sh
配置Hadoop環境變數
# Hadoop environment
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
source /etc/profile
Hadoop檔配置
配置檔都在/opt/hadoop/etc/hadoop/檔夾下hadoop-env.sh
、core-site.xml
、hdfs-site.xml
(其餘workers
、mapred-site.xml
、yarn-site.xml
等內容略)
jdk環境變數 (要遠程調用 ${java_home}
)
export JAVA_HOME=/usr/local/java/jdk
core-site.xml
<configuration>
<!-- hdfs的位置 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop0:9000</value>
</property>
<!-- hadoop運行的緩衝檔位置 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<!-- hdfs 數據副本數量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- hdfs namenode上存儲hdfs名字空間元數據 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/hdfs/name</value>
</property>
<!-- hdfs datanode上數據塊的物理存儲位置 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Hadoop 驗證
首先格式化
只能格式化一次,如出現問題需重新格式化參考下方操作預處理(數據會全部丟失!!!)
hdfs namenode -format
啟動提示“Name or service not knownstname”
- 刪除
workers
後重建檔 - 需查看
core-site.xml
、hdfs-site.xml
路徑,刪除hdfs對應目錄後重建目錄, - 重新格式化hdfs
啟動、關閉服務測試
mapred —daemon start historyserver
mapred —daemon stop historyserver
start-yarn.sh && start-dfs.sh
stop-dfs.sh && stop-yarn.sh
Spark on yarn部署、配置
vi /opt/spark/conf/spark-env.sh
export JAVA_HOME=/usr/local/java/jdk
# Hadoop 的配置檔目錄
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# YARN 的配置檔目錄
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# SPARK 的目錄
export SPARK_HOME=/opt/spark
# SPARK 執行檔目錄
export PATH=${SPARK_HOME}/bin:$PATH
export SPARK_MASTER_HOST=hadoop0
複製到其他node
scp -r /opt/spark hadoop1:/opt/
scp -r /opt/spark hadoop2:/opt/
scp -r /opt/spark hadoop3:/opt/
- 配置Spark環境變數
#Spark environment
export SPARK_HOME=/opt/spark
export PATH=${SPARK_HOME}/bin:$PATH
- 配置好
worker
,啟動spark節點
hadoop@hadoop0:/opt/spark/sbin$ ./start-all.sh
hadoop0: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop0.out
hadoop2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop2.out
hadoop1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop1.out
hadoop3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop3.out
Hive配置
- 解決Hive與Hadoop之間guava版本的差異,copy hadoop中的guava到hive
cp hadoop/share/hadoop/common/lib/guava-*.jar hive/lib/
- 驅動包放到hive/lib下,
mssql-jdbc-7.4.1.jre8.jar
- 配置
hive-site.xml
、hive-env.sh
- 初始化元數據
schematool -initSchema -dbType mssql --verbose
Failed to initialize pool: 驅動程式無法通過使用安全套接字層(SSL)加密與 SQL Server 建立安全連接。unable to find valid certification path to requested target
修改jre\lib\security\java.security,刪除jdk.tls.disabledAlgorithms中的3DES_EDE_CBC
安全配置相關
Hadoop3.0開始默認端口的更改
Namenode 端口:
https服務的端口50470 → 9871
NameNode web管理端口50070 → 9870
配置檔hdfs-site.xml
namenode RPC交互端口,用於獲取檔系統metadata資訊。8020 → 9820
配置檔core-site.xml
Secondary NN 端口:
暫未了解到50091 → 9869
secondary NameNode web管理端口50090 → 9868
Datanode 端口:
datanode的IPC伺服器地址和端口50020 → 9867
配置檔hdfs-site.xml
datanode控制端口,用於數據傳輸50010 → 9866
配置檔hdfs-site.xml
https服務的端口50475 → 9865
datanode的HTTP伺服器和端口50075 → 9864
配置檔hdfs-site.xml
其他需要放行的端口
job tracker交互端口8021
Hive服務端口 10000:10002
spark web UI端口4040
:spark-defaults.conf中spark.ui.port
spark master註冊7077
spark masterUI8080
; spark WorkerUI8081