Spark Chapter 7 Spark优化

0 Web Interfaces

Every SparkContext launches a web UI, by default on port 4040（如果第二个会是4041，4042）, that displays useful information about the application. This includes:

A list of scheduler stages and tasks

A summary of RDD sizes and memory usage

Environmental information.

Information about the running executors

spark_home/conf/historyserver配置

在spark-defaults.conf配置spark.eventLog.enableed/dir

spark_home/conf/spark-env.sh配置

SPARK_HISTORY_OPTS = "-Dspark.history.fs.logDirectory=hdfs路径 "

启动日志服务

./start -

tip：启动hdfs

在hadoophome的sbin里启动

./start -dfs.sh

fs -ls /

1 序列化

序列化作用

网络传输，内存节省

如何选择

java序列化比较慢

kryo序列化，比较快，非默认使用机制，更紧凑，不能支持所有的序列化类型，需要在代码内实现注册进来。

val conf = new SparkConf().setMaster(...).setAppName(...)

conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

val sc = new SparkContext(conf)

不太明白，什么是序列化，为什么要注册类，注册什么类。

2 内存管理

spark计算效率高是由于内存管理能力强

优化方向：1 你对的对象使用内存 2 访问对象的消耗 3 内存回收机制的效率

内存消耗两大类，共享统一的内存：（JVM空间-300M）*0.6

执行：suffle，join，sort，agg聚合 ——各占0.5

存储：caching，propagating——各占0.5

可以设定阈值控制内存分配占比

内存参数设置：

spark.memory.fraction

spark.memory.storageFraction

查看RDD内存使用情况：

WedUI和方法

问方法怎么使用

【面试重点】Memory Management Overview

Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R describes a subregion within M where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.

This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.

3 广播变量

保持只读变量，在物理机器上共享比变量。

broadcastVar = sc.broadcast([1, 2, 3])

broadcastVar.value

优势：Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster.

使用原则： in general tasks larger than about 20 KB are probably worth optimizing.

4 数据本地化操作

一般把序列化的代码移动，移动代码而不是移动计算。

几种方式

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible

NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes

NO_PREF data is accessed equally quickly from anywhere and has no locality preference

RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch

ANY data is elsewhere on the network and not in the same rack

参数设置

spark.locality.wait 默认3s 向下切换操作方式

Spark Chapter 7 Spark优化

推荐阅读更多精彩内容