Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

从这节课开始，简介Spark Streaming的状态管理。

SparkStreaming 是按BatchDuration来划分Job的，但我们有时需要根据业务要求按照另外的时间周期（比如说，对过去24小时、或者过去一周的数据，等等这些大于BatchDuration的周期），对数据进行处理（比如计算最近24小时的销售额排名、今年的最新销售量等）。这需要根据之前的计算结果和新时间周期的数据，计算出新的计算结果。

updateStateByKey和mapWithState都是针对类型的数据进行操作，而RDD类本身并不对类型的数据进行操作，所以要借助隐式转换。隐式转换放在了DStream伴生对象的区域。

object DStream {

// `toPairDStreamFunctions` was in SparkContext before 1.3 and users had to

// `import StreamingContext._` to enable it. Now we move it here to make the compiler find

// it automatically. However, we still keep the old function in StreamingContext for backward

// compatibility and forward to the following function directly.

implicit deftoPairDStreamFunctions[K, V](stream: DStream[(K, V)])

(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):

PairDStreamFunctions[K, V] = {

newPairDStreamFunctions[K, V](stream)

}

...

}

生成了PairDStreamFunctions对象。PairDStreamFunctions类中有updateStateByKey、mapWithState这些功能。

1.updateStateByKey解密

先看updateStateByKey：

/**

* Return a new "state" DStream where the state for each key is updated by applying

* the given function on the previous state of the key and the new values of each key.

* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.

* @param updateFunc State update function. If `this` function returns None, then

* corresponding state key-value pair will be eliminated.

* @tparam S State type

defupdateStateByKey[S: ClassTag](

updateFunc: (Seq[V], Option[S]) => Option[S]

): DStream[(K, S)] = ssc.withScope {

updateStateByKey(updateFunc,defaultPartitioner())

}

updateStateByKey返回的都是DStream类型。

根据updateFunc这个函数来更新状态。其中参数：Seq[V]是本次的数据类型，Option[S]是前次计算结果类型，本次计算结果类型也是Option[S]。

计算肯定需要Partitioner。因为Hash高效率且不做排序，默认Partitioner是HashPartitoner。

PairDStreamFunction.defaultPartitioner：

private[streaming] defdefaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism) = {

newHashPartitioner(numPartitions)

}

看其中返回值类型为StateDStream的updateStateByKey：

/**

* Return a new "state" DStream where the state for each key is updated by applying

* the given function on the previous state of the key and the new values of each key.

* org.apache.spark.Partitioner is used to control the partitioning of each RDD.

* @param updateFunc State update function. Note, that this function may generate a different

* tuple with a different key than the input key. Therefore keys may be removed

* or added in this way. It is up to the developer to decide whether to

* remember the partitioner despite the key being changed.

* @param partitioner Partitioner for controlling the partitioning of each RDD in the new

* DStream

* @param rememberPartitioner Whether to remember the paritioner object in the generated RDDs.

* @tparam S State type

defupdateStateByKey[S: ClassTag](

updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

partitioner: Partitioner,

rememberPartitioner: Boolean

): DStream[(K, S)] = ssc.withScope {

newStateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)

}

看看这个StateDStream：

classStateDStream[K: ClassTag, V: ClassTag, S: ClassTag](

parent: DStream[(K, V)],

updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

partitioner: Partitioner,

preservePartitioning: Boolean,

initialRDD : Option[RDD[(K, S)]]

) extends DStream[(K, S)](parent.ssc) {

super.persist(StorageLevel.MEMORY_ONLY_SER)

...

是基于内存的。

所有像StateDStream这样的DStream子类都要覆写compute方法。

StateDStream.compute：

override defcompute(validTime: Time): Option[RDD[(K, S)]] = {

// Try to get the previous state RDD

getOrCompute(validTime - slideDuration) match {

case Some(prevStateRDD) => { // If previous state RDD exists

// Try to get the parent RDD

parent.getOrCompute(validTime) match {

case Some(parentRDD) => { // If parent RDD exists, then compute as usual

computeUsingPreviousRDD(parentRDD, prevStateRDD)

}

case None => { // If parent RDD does not exist

// Re-apply the update function to the old state RDD

val updateFuncLocal = updateFunc

val finalFunc = (iterator: Iterator[(K, S)]) => {

val i = iterator.map(t => (t._1, Seq[V](), Option(t._2)))

updateFuncLocal(i)

}

val stateRDD = prevStateRDD.mapPartitions(finalFunc, preservePartitioning)

Some(stateRDD)

}

case None => { // If previous session RDD does not exist (first input data)

// Try to get the parent RDD

parent.getOrCompute(validTime) match {

case Some(parentRDD) => { // If parent RDD exists, then compute as usual

initialRDD match {

case None => {

// Define the function for the mapPartition operation on grouped RDD;

// first map the grouped tuple to tuples of required type,

// and then apply the update function

val updateFuncLocal = updateFunc

val finalFunc = (iterator : Iterator[(K, Iterable[V])]) => {

updateFuncLocal (iterator.map (tuple => (tuple._1, tuple._2.toSeq, None)))

}

val groupedRDD = parentRDD.groupByKey (partitioner)

val sessionRDD = groupedRDD.mapPartitions (finalFunc, preservePartitioning)

// logDebug("Generating state RDD for time " + validTime + " (first)")

Some (sessionRDD)

}

case Some (initialStateRDD) => {

computeUsingPreviousRDD(parentRDD, initialStateRDD)

}

case None => { // If parent RDD does not exist, then nothing to do!

// logDebug("Not generating state RDD (no previous state, no parent)")

None

}

其中会用到computeUsingPreviousRDD方法。去看看。

StateDStream.computeUsingPreviousRDD：

private [this] defcomputeUsingPreviousRDD(

parentRDD : RDD[(K, V)], prevStateRDD : RDD[(K, S)]) = {

// Define the function for the mapPartition operation on cogrouped RDD;

// first map the cogrouped tuple to tuples of required type,

// and then apply the update function

val updateFuncLocal = updateFunc

val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {

val i = iterator.map(t => {

val itr = t._2._2.iterator

val headOption = if (itr.hasNext) Some(itr.next()) else None

(t._1, t._2._1.toSeq, headOption)

})

updateFuncLocal(i)

}

val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)

val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)

Some(stateRDD)

}

由于cogroup会对所有数据进行扫描，再按key进行分组，所以性能上会有问题。特别是随着时间的推移，这样的计算到后面会越算越慢。

所以数据量大的计算、复杂的计算，都不建议使用updateStateByKey。

2.mapWithState解密

虽然有人使用mapWithState后感觉效果还可以，但源码中仍表明，mapWithState还在试验状态。

mapWithState方法有多个。先看第一个。

PairDStreamFunctions.mapWithState：

/**

* :: Experimental ::

* Return a [[MapWithStateDStream]] by applying a function to every key-value element of

* `this` stream, while maintaining some state data for each unique key. The mapping function

* and other specification (e.g. partitioners, timeouts, initial state data, etc.) of this

* transformation can be specified using [[StateSpec]] class. The state data is accessible in

* as a parameter of type [[State]] in the mapping function.

* Example of using `mapWithState`:

* {{{

* // A mapping function that maintains an integer state and return a String

* def mappingFunction(key: String, value: Option[Int], state:State[Int]): Option[String] = {

* // Use state.exists(), state.get(), state.update() and state.remove()

* // to manage state, and return the necessary string

* }

* val spec = StateSpec.function(mappingFunction).numPartitions(10)

* val mapWithStateDStream = keyValueDStream.mapWithState[StateType, MappedType](spec)

* }}}

* @param spec Specification of this transformation

* @tparam StateType Class type of the state data

* @tparam MappedType Class type of the mapped data

@Experimental

defmapWithState[StateType: ClassTag, MappedType: ClassTag](

spec:StateSpec[K, V, StateType, MappedType]

):MapWithStateDStream[K, V, StateType, MappedType] = {

new MapWithStateDStreamImpl[K, V, StateType, MappedType](

self,

spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]

)

}

注释中给出了一个mapWithState使用实例。先要定义一个mappingFunction。mappingFunction的参数中，State类型的state是历史数据，相当于一个内存数据表；key指明是对state中的哪个键进行操作；value指明键值。

StateSpec类型的参数中封装了mapping功能和转换的相应配置（例如：partitioners、超时设定、初始状态数据等）。

mapWithState

返回的是MapWithStateDStream类型。

来看看State类。其中的注释有例子参考。

/**

* :: Experimental ::

* Abstract class for getting and updating the state in mapping function used in the `mapWithState`

* operation of a [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair DStream]] (Scala)

* or a [[org.apache.spark.streaming.api.java.JavaPairDStream JavaPairDStream]] (Java).

* Scala example of using `State`:

* {{{

* // A mapping function that maintains an integer state and returns a String

* def mappingFunction(key: String, value: Option[Int], state: State[Int]): Option[String] = {

* // Check if state exists

* if (state.exists) {

* val existingState = state.get // Get the existing state

* val shouldRemove = ... // Decide whether to remove the state

* if (shouldRemove) {

* state.remove() // Remove the state

* } else {

* val newState = ...

* state.update(newState) // Set the new state

* }

* } else {

* val initialState = ...

* state.update(initialState) // Set the initial state

* }

* ... // return something

* }

* }}}

...

sealed abstractclassState[S] {

State中有exists、get、update、remove、isTimingOut等需要在子类中覆写的方法。

State中还有个内部实现类StateImpl：

/** Internal implementation of the [[State]] interface */

private[streaming] classStateImpl[S] extends State[S] {

private var state: S = null.asInstanceOf[S]

private var defined: Boolean = false

private var timingOut: Boolean = false

private var updated: Boolean = false

private var removed: Boolean = false

// ========= Public API =========

override def exists(): Boolean = {

defined

}

...

StateImpl有一些状态变量，并且覆写了State中的方法。

回去再看

PairDStreamFunctions中的其它mapWithState方法。

@Experimental

defmapWithState[StateType: ClassTag, MappedType: ClassTag](

spec: StateSpec[K, V, StateType, MappedType]

): MapWithStateDStream[K, V, StateType, MappedType] = {

newMapWithStateDStreamImpl[K, V, StateType, MappedType](

self,

spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]

)

}

先看看StateSpecImpl。StateSpecImpl是StateSpec类中的case

class。

/** Internal implementation of [[org.apache.spark.streaming.StateSpec]] interface. */

private[streaming]

case classStateSpecImpl[K, V, S, T](

function: (Time, K, Option[V], State[S]) => Option[T])extends StateSpec[K, V, S, T] {

其参数是一个函数。

StateSpecImpl中的代码片段：

require(function != null)

@volatile private var partitioner: Partitioner = null

@volatile private var initialStateRDD: RDD[(K, S)] = null

@volatile private var timeoutInterval: Duration = null

...

// ================= Private Methods =================

private[streaming] defgetFunction(): (Time, K, Option[V], State[S]) => Option[T] =function

private[streaming] def getInitialStateRDD(): Option[RDD[(K, S)]] = Option(initialStateRDD)

private[streaming] def getPartitioner(): Option[Partitioner] = Option(partitioner)

private[streaming] def getTimeoutInterval(): Option[Duration] = Option(timeoutInterval)

有一些私有变量，及其变量的获取方法。特别是有一个函数的获取方法。

再看看MapWithStateDStream的子类MapWithStateDStreamImpl：

/** Internal implementation of the [[MapWithStateDStream]] */

private[streaming] classMapWithStateDStreamImpl[

KeyType: ClassTag, ValueType: ClassTag, StateType: ClassTag, MappedType: ClassTag](

dataStream: DStream[(KeyType, ValueType)],

spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])

extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {

private valinternalStream=

newInternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)

override def slideDuration: Duration = internalStream.slideDuration

override def dependencies: List[DStream[_]] = List(internalStream)

override defcompute(validTime: Time): Option[RDD[MappedType]] = {

internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }

}

其中生成了一个DStream子类InternalMapWithStateDStream的对象。

InternalMapWithStateDStream类：

private[streaming]

classInternalMapWithStateDStream[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

parent: DStream[(K, V)], spec: StateSpecImpl[K, V, S, E])

extends DStream[MapWithStateRDDRecord[K, S, E]](parent.context) {

persist(StorageLevel.MEMORY_ONLY)

InternalMapWithStateDStream.compute：

/** Method that generates a RDD for the given time */

override defcompute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {

// Get the previous state or create a new empty state RDD

val prevStateRDD = getOrCompute(validTime - slideDuration) match {

case Some(rdd) =>

if (rdd.partitioner != Some(partitioner)) {

// If the RDD is not partitioned the right way, let us repartition it using the

// partition index as the key. This is to ensure that state RDD is always partitioned

// before creating another state RDD using it

MapWithStateRDD.createFromRDD[K, V, S, E](

rdd.flatMap { _.stateMap.getAll() }, partitioner, validTime)

} else {

rdd

}

case None =>

MapWithStateRDD.createFromPairRDD[K, V, S, E](

spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),

partitioner,

validTime

)

}

// Compute the new state RDD with previous state RDD and partitioned data RDD

// Even if there is no data RDD, use an empty one to create a new state RDD

val dataRDD = parent.getOrCompute(validTime).getOrElse {

context.sparkContext.emptyRDD[(K, V)]

}

val partitionedDataRDD = dataRDD.partitionBy(partitioner)

val timeoutThresholdTime = spec.getTimeoutInterval().map { interval =>

(validTime - interval).milliseconds

}

Some(newMapWithStateRDD(

prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))

}

生成了RDD子类MapWithStateRDD的对象。

MapWithStateRDD：

private[streaming] classMapWithStateRDD[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

private var prevStateRDD: RDD[MapWithStateRDDRecord[K, S, E]],

private var partitionedDataRDD: RDD[(K, V)],

mappingFunction: (Time, K, Option[V], State[S]) => Option[E],

batchTime: Time,

timeoutThresholdTime: Option[Long]

) extends RDD[MapWithStateRDDRecord[K, S, E]](

partitionedDataRDD.sparkContext,

List(

new OneToOneDependency[MapWithStateRDDRecord[K, S, E]](prevStateRDD),

new OneToOneDependency(partitionedDataRDD))

) {

每个RDD partition是被一个MapWithStateRDDRecord类型的记录所代表，

MapWithStateRDD.compute：

override defcompute(

partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {

val stateRDDPartition = partition.asInstanceOf[MapWithStateRDDPartition]

val prevStateRDDIterator = prevStateRDD.iterator(

stateRDDPartition.previousSessionRDDPartition, context)

val dataIterator = partitionedDataRDD.iterator(

stateRDDPartition.partitionedDataRDDPartition, context)

val prevRecord = if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None

val newRecord = MapWithStateRDDRecord.updateRecordWithData(

prevRecord,

dataIterator,

mappingFunction,

batchTime,

timeoutThresholdTime,

removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled

)

Iterator(newRecord)

}

MapWithStateRDDRecord有伴生对象：

private[streaming] objectMapWithStateRDDRecord{

defupdateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

prevRecord: Option[MapWithStateRDDRecord[K, S, E]],

dataIterator: Iterator[(K, V)],

mappingFunction: (Time, K, Option[V], State[S]) => Option[E],

batchTime: Time,

timeoutThresholdTime: Option[Long],

removeTimedoutData: Boolean

): MapWithStateRDDRecord[K, S, E] = {

// Create a new state map by cloning the previous one (if it exists) or by creating an empty one

val newStateMap = prevRecord.map { _.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }

val mappedData = new ArrayBuffer[E]

val wrappedState = new StateImpl[S]()

// Call the mapping function on each record in the data iterator, and accordingly

// update the states touched, and collect the data returned by the mapping function

此处是mapWithState性能较好的核心代码之所在。

dataIterator.foreach { case (key, value) =>

wrappedState.wrap(newStateMap.get(key))

val returned = mappingFunction(batchTime, key, Some(value), wrappedState)

if (wrappedState.isRemoved) {

newStateMap.remove(key)

} else if (wrappedState.isUpdated

|| (wrappedState.exists && timeoutThresholdTime.isDefined)) {

newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)

}

mappedData ++= returned

}

// Get the timed out state records, call the mapping function on each and collect the

// data returned

if (removeTimedoutData && timeoutThresholdTime.isDefined) {

newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>

wrappedState.wrapTimingOutState(state)

val returned = mappingFunction(batchTime, key, None, wrappedState)

mappedData ++= returned

newStateMap.remove(key)

}

MapWithStateRDDRecord(newStateMap, mappedData)

}

借助了RDD的不变性，同时也借助了可变化特征，完成了高效的处理过程。

所以不可变的RDD也可用于处理变化的数据。

备注：

资料来源于：DT_大数据梦工厂（Spark发行版本定制）

更多私密内容，请关注微信公众号：DT_Spark

如果您对大数据Spark感兴趣，可以免费听由王家林老师每天晚上20：00开设的Spark永久免费公开课，地址YY房间号：68917580

最后编辑于：2017.12.03 06:07:10

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,905评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,140评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,791评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,483评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,476评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,516评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,905评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,560评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,778评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,557评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,635评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,338评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,925评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,898评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,142评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,818评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,347评论 2赞 342

Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

推荐阅读更多精彩内容