AI摘要:本文介绍了在四节点(1个master+3个slave)环境中部署Hadoop和Spark集群的详细步骤。包括配置主机名和hosts文件、新建hadoop用户、配置免密登录、安装和配置JDK、部署和配置Hadoop、验证Hadoop、配置Spark on yarn以及Hive的配置。此外,还涉及了安全配置,如Hadoop3.0开始默认端口的更改。
Powered by 部落Bot.
本次設計共四節點:1個master+3個slave。
下面的 JDK 和 Hadoop 安裝配置操作都是使用普通用戶 hadoop 來執行,並非 root。

集群環境準備

每臺機器配置hostname、hosts

hostnamectl set-hostname hadoop2

echo "# Hadoop
192.100.3.254    hadoop0
192.100.3.253  hadoop1
192.100.3.252  hadoop2
192.100.3.251  hadoop3" >> /etc/hosts;

新建hadoop用戶

useradd -m -s /bin/bash hadoop

以下均以Hadoop執行

需將/opt/usr/local/java/等目錄所有者改為hadoop
su - hadoop
#master
ssh-keygen -t rsa -C “hadoop0” -P “”
#slave1
ssh-keygen -t rsa -C “hadoop1” -P “”
#slave2
ssh-keygen -t rsa -C “hadoop2” -P “”
#slave3
ssh-keygen -t rsa -C “hadoop3” -P “”

各節點配置免密登錄

ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop0
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop3

JDK 安裝與配置

JDK手動安裝到/usr/lib/java

#切換到hadoop用戶
sudo - hadoop
ln -sf /usr/lib/java/jdk1.8.0_331/ /usr/lib/java/jdk

JDK環境變數的配置

vi /etc/profile.d/java.sh

#JDK environment
export JAVA_HOME=/usr/lib/java/jdk
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATh=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

使profile生效,驗證java版本

source /etc/profile
java -version

Hadoop 部署與配置

hadoop安裝包到/opt;修改所有者;配置軟鏈接;

chown -R hadoop:hadoop /opt/hadoop-3.2.3
ln -sf /opt/hadoop-3.2.3 /opt/hadoop

配置日誌路徑

/opt/hadoop-3.2.3/etc/hadoop/log4j.properties

mkdir /opt/hadoop/logs
mkdir -p /opt/hadoop/hdfs/name
mkdir -p /opt/hadoop/hdfs/data
nano /etc/profile.d/hadoop.sh

配置Hadoop環境變數

# Hadoop environment
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

source /etc/profile

Hadoop檔配置

配置檔都在/opt/hadoop/etc/hadoop/檔夾下
hadoop-env.shcore-site.xmlhdfs-site.xml(其餘workersmapred-site.xmlyarn-site.xml等內容略)

jdk環境變數 (要遠程調用 ${java_home}

export JAVA_HOME=/usr/local/java/jdk


core-site.xml

<configuration>

  <!-- hdfs的位置 -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop0:9000</value>
  </property>

  <!-- hadoop運行的緩衝檔位置 -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop/tmp</value>
  </property>

</configuration>



hdfs-site.xml

<configuration>
  <!-- hdfs 數據副本數量 -->
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <!-- hdfs namenode上存儲hdfs名字空間元數據 -->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/opt/hadoop/hdfs/name</value>
  </property>
  <!-- hdfs datanode上數據塊的物理存儲位置 -->
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/opt/hadoop/hdfs/data</value>
  </property>
<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>
</configuration>

Hadoop 驗證

首先格式化

只能格式化一次,如出現問題需重新格式化參考下方操作預處理(數據會全部丟失!!!)

hdfs namenode -format
啟動提示“Name or service not knownstname”
  1. 刪除workers後重建檔
  2. 需查看core-site.xmlhdfs-site.xml路徑,刪除hdfs對應目錄後重建目錄,
  3. 重新格式化hdfs

啟動、關閉服務測試

mapred —daemon start historyserver
mapred —daemon stop historyserver
start-yarn.sh && start-dfs.sh
stop-dfs.sh && stop-yarn.sh

Spark on yarn部署、配置

vi /opt/spark/conf/spark-env.sh

export JAVA_HOME=/usr/local/java/jdk
# Hadoop 的配置檔目錄
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# YARN 的配置檔目錄
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# SPARK 的目錄
export SPARK_HOME=/opt/spark
# SPARK 執行檔目錄
export PATH=${SPARK_HOME}/bin:$PATH

export SPARK_MASTER_HOST=hadoop0

複製到其他node

scp -r /opt/spark hadoop1:/opt/
scp -r /opt/spark hadoop2:/opt/
scp -r /opt/spark hadoop3:/opt/
  • 配置Spark環境變數
#Spark environment
export SPARK_HOME=/opt/spark
export PATH=${SPARK_HOME}/bin:$PATH
  • 配置好worker,啟動spark節點
hadoop@hadoop0:/opt/spark/sbin$ ./start-all.sh
hadoop0: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop0.out
hadoop2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop2.out
hadoop1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop1.out
hadoop3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop3.out

Hive配置

  • 解決Hive與Hadoop之間guava版本的差異,copy hadoop中的guava到hive
cp hadoop/share/hadoop/common/lib/guava-*.jar hive/lib/
  • 驅動包放到hive/lib下,mssql-jdbc-7.4.1.jre8.jar
  • 配置hive-site.xmlhive-env.sh
  • 初始化元數據
schematool -initSchema -dbType mssql --verbose
Failed to initialize pool: 驅動程式無法通過使用安全套接字層(SSL)加密與 SQL Server 建立安全連接。unable to find valid certification path to requested target

修改jre\lib\security\java.security,刪除jdk.tls.disabledAlgorithms中的3DES_EDE_CBC


安全配置相關

Hadoop3.0開始默認端口的更改

Namenode 端口:

https服務的端口50470 → 9871
NameNode web管理端口50070 → 9870配置檔hdfs-site.xml
namenode RPC交互端口,用於獲取檔系統metadata資訊。8020 → 9820配置檔core-site.xml

Secondary NN 端口:

暫未了解到50091 → 9869
secondary NameNode web管理端口50090 → 9868

Datanode 端口:

datanode的IPC伺服器地址和端口50020 → 9867配置檔hdfs-site.xml
datanode控制端口,用於數據傳輸50010 → 9866配置檔hdfs-site.xml
https服務的端口50475 → 9865
datanode的HTTP伺服器和端口50075 → 9864配置檔hdfs-site.xml

其他需要放行的端口

job tracker交互端口8021
Hive服務端口 10000:10002
spark web UI端口4040:spark-defaults.conf中spark.ui.port
spark master註冊7077
spark masterUI8080; spark WorkerUI8081