SPARK September 28, 2018

Spark优化

Words count 3k Reading time 3 mins. Read count 0

1 HistoryServer配置及使用

官网地址:http://spark.apache.org/docs/latest/monitoring.html

1.1 开启HistoryServer

要想开启spark的HistoryServer只需要修改$SPARK_HOME/conf/spark-defaults.conf文件,将spark.eventLog.enabled设置为true。

首先将spark-defaults.conf.template 重名为 spark-defaults.conf

cp spark-defaults.conf.template spark-defaults.conf

然后修改内容如下:

spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://localhost:9000/directory

开启spark日志,并制定日志路径

1.2 设置SPARK_HISTORY_OPTS

开启HistorySever之后需要设置SPARK_HISTORY_OPTS相关参数,如设置端口号(spark.history.ui.port)、设置logDir(spark.history.fs.logDirectory)

修改$SPARK_HOME/conf/spark-env.sh文件,内容为:

SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://localhost:9000/directory"

1.3 启动HisoryServer服务

在$SPARK_HOME/sbin/下启动

sh start-history-server.sh

2 序列化

官网:http://spark.apache.org/docs/latest/tuning.html#data-serialization

3 内存管理

官网:http://spark.apache.org/docs/latest/tuning.html#memory-tuning

spark内存用于计算和存储两方面

4 广播变量

官网:http://spark.apache.org/docs/latest/tuning.html#broadcasting-large-variables

>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>

>>> broadcastVar.value
[1, 2, 3]

5 数据本地化

官网:http://spark.apache.org/docs/latest/tuning.html#data-locality

移动计算

6 项目性能优化

  • 并行度:spark.sql.shuffle.partitions
  • 分区字段类型推测:spark.sql.sources.partitionColumnTypeInference.enabled
0%