ceph peering

该系列文章主要记录阅读理解ceph代码时可能遇到的一些难点，可能跳跃比较大。如果有描述错误或任何疑问欢迎交流讨论。

首先上官方文档
docs.ceph.com/docs/master/dev/peering/
这里面基本把重点都讲出来了，尤其是记住那个Golden Rule，peering期间是不能处理io请求的，当然也是不能做recovery的。在建立这些概念之后再结合peering状态机，对比代码，大体就能看明白了。

注意peering 和 recovery是有明确界限的，peering的结果是大家达成一致，但并未开始recovery。peering与recovery边界在activate，activate将peering达成一致的结果（pg log，last_update等）同步给actingbackfill中的副本，然后启动recovery。

这里记录几个关键点，对于理解peering非常重要。

interval

current interval  or  past interval
        a sequence of OSD map epochs during which the acting set and up set for particular PG do not change

上面是官方文档中的描述，如果将osdmap的epoch看成时间，那么interval就是描述一个时间段，在这段时间内pg的up和acting位置均没发生变化。

那么interval是必须的吗？是一开始就有的吗？
不是的，interval是为了优化prior_set的计算而设计的。在该commit中引入:

SHA-1: 1c64db014c681f2746a43faceb4775ed897a3269

* osd: remember past intervals instead of recalculating each time

This _vastly_ improves the speed of build_prior (and thus activate_map).
There is no need to recalculate this information each time as it is fully
dependent on _old_ OSDMaps, not current cluster state.

pg的past_intervals也是这个patch引入的。后来interval结构从PG独立出来，移到osd_types.h中。

last_epoch_started

这是另一个非常重要的概念，简称 les，这个epoch是peering真正完成的里程碑。官方概念描述：
···
last epoch start
the last epoch at which all nodes in the acting set for a particular placement group agreed on an authoritative history. At this point, peering is deemed to have been successful.
···
这里应该是last epoch started，更详细的解释参考：
last_epoch_started
其中这段话可以作为关键函数 find_best_info 的部分逻辑依据：

info.history.last_epoch_started records a lower bound on the most recent interval in which the pg as a whole     
went active and accepted writes. On a particular osd, it is also an upper bound on the activation epoch of intervals in which 
writes in the local pg log occurred (we update it before accepting writes). Because all committed writes are committed by all 
acting set osds, any non-divergent writes ensure that history.last_epoch_started was recorded by all acting set members in the 
interval. Once peering has queried one osd from each interval back to some seen history.last_epoch_started, it follows that no 
interval after the max history.last_epoch_started can have reported writes as committed (since we record it before recording 
client writes in an interval). Thus, the minimum last_update across all infos with info.last_epoch_started >= 
MAX(history.last_epoch_started) must be an upper bound on writes reported as committed to the client.

les的作用：
1，作为peering历史上的checkpoint，用来减少build_prior的时候需要检查的osdmap epoch(包括减少需要检查的past intervals)。
2，作为pg进入Active 提供服务的epoch，在选取auth log shard时排除部分peer。peering的时候必须选取les最大的peer作为auth，也就是要找到最后的历史见证人，否则就不是auth。

maybe_went_rw

表明一个interval是否有rw操作，可以过滤掉一些interval，在上述peering文档介绍up_thru中有说明。

那它与les有何区别呢？
les是去掉尾部的interval，而maybe_went_rw可以过滤掉les 到当前epoch之间的interval。

注意这个maybe，因为up_thru是记录在osdmap上的，是整个osd的状态，反应不了单个pg的状态，因而interval即便maybe_went_rw是true的也无法说明该interval完成了peering（activate成功才算完成peering）。

我们分析下maybe_went_rw是怎么设置的

 if (num_acting &&
    i.primary != -1 &&
    num_acting >= old_pg_pool.min_size &&
        (*could_have_gone_active)(old_acting_shards)) {
      if (out)
    *out << "generate_past_intervals " << i
         << ": not rw,"
         << " up_thru " << lastmap->get_up_thru(i.primary)
         << " up_from " << lastmap->get_up_from(i.primary)
         << " last_epoch_clean " << last_epoch_clean
         << std::endl;
      if (lastmap->get_up_thru(i.primary) >= i.first &&
      lastmap->get_up_from(i.primary) <= i.first) {
    i.maybe_went_rw = true;
    if (out)
      *out << "generate_past_intervals " << i
           << " : primary up " << lastmap->get_up_from(i.primary)
           << "-" << lastmap->get_up_thru(i.primary)
           << " includes interval"
           << std::endl;
      } else if (last_epoch_clean >= i.first &&
         last_epoch_clean <= i.last) {
    // If the last_epoch_clean is included in this interval, then
    // the pg must have been rw (for recovery to have completed).
    // This is important because we won't know the _real_
    // first_epoch because we stop at last_epoch_clean, and we
    // don't want the oldest interval to randomly have
    // maybe_went_rw false depending on the relative up_thru vs
    // last_epoch_clean timing.
    i.maybe_went_rw = true;
    if (out)
      *out << "generate_past_intervals " << i
           << " : includes last_epoch_clean " << last_epoch_clean
           << " and presumed to have been rw"
           << std::endl;
      } else {
    i.maybe_went_rw = false;
    if (out)
      *out << "generate_past_intervals " << i
           << " : primary up " << lastmap->get_up_from(i.primary)
           << "-" << lastmap->get_up_thru(i.primary)
           << " does not include interval"
           << std::endl;
      }
    } else {
      i.maybe_went_rw = false;
      if (out)
    *out << "generate_past_intervals " << i << " : acting set is too small" << std::endl;
    }

根据up_thru设置很好理解，up_thru成功之后pg就进行activate，activate成功就能处理io请求，因而很maybe_went_rw。但是下面根据last_epoch_clean设置是啥意思呢？ 查找提交记录 发现2011.10.23 的一个commit增加的：

SHA-1: 12b3b2d5af01be253980875b386b892b57f951bc

* osd: fix generate_past_intervals maybe_went_rw on oldest interval

We stop working backwards when we hit last_epoch_clean, which means for the
oldest interval first_epoch may not be the _real_ first_epoch.  (We can't
continue working backward because we may have thrown out those maps
entirely.)

However, if the last_epoch_clean epoch is contained within that interval,
we know that the OSD did in fact go rw because it had to have completed
recovery (and thus peering) to set last_clean_epoch in the first place.

This fixes cases where two different nodes have slightly different
past intervals, generate different prior probe sets as a result, and
flip/flop on the acting set choice.  (It may have eventually resolved when
the wrongly excluded node's notify races and arrives in time to be
considered, but that's still clearly no good.)

This does leave the start epoch for that oldest interval incorrect.  That
doesn't currently matter except that it's confusing, but I'm not sure how
to mark it properly, or if it's worth the effort.

Signed-off-by: Sage Weil <sage@newdream.net>

这段是说计算past_intervals时到last_epoch_clean就结束了,为什么结束呢？因为osd并不保留last_epoch_clean之前的osdmap。所以那个包含last_epoch_clean的interval可能是残缺(incorrect)的，该interval的起点就变成了last_epoch_clean，up_thru的epoch可能在last_epoch_clean之前，也可能在这之后。不过这都没关系，既然last_epoch_clean发生在那个interval，那么显然是went_rw的，因而也将maybe_went_rw设置为true。

pg_info_t

pg info也就是pg的元数据。其中last_update, last_backfill, log_tail,以及上面提到的les都和peering有关系。last_complete主要是跟recovery相关。

last_update

last_update 可以看成是pg log的头指针，指向log head。

注意last_update仅仅表明有对应的pg log，但是pg log对应的数据不一定有

last_backfill

last_backfill 跟last_complete一样主要是恢复用的，记录全量恢复的对象位置。

对于peering，backfill peer就是一个例外，这个例外相当于说那哥们已经落伍了，我们要特别照顾下它。

在peering的时候，尚未完成backfill的peer是无法作为auth log shard的。

为什么呢？
这是因为backfill的peer的pg log是虚的，primary给backfill的peer发送的sub op中可能并未携带数据。如果让backfill的peer作为auth log shard，可能会导致一些pg log对应的修改在任何pg副本都没有记录。所以Golden Rule里面提到的io是被每个acting set里面的成员记录的，但不包括backfill。backfill就是一个例外。

log_tail

log tail比较好理解，主要是用来判断能否做增量恢复。

pg_history_t

这个结构体里面有三类成员。
第一类是描述pg一些重要的历史事件。

  epoch_t epoch_created;       // epoch in which PG was created
  epoch_t last_epoch_started;  // lower bound on last epoch started (anywhere, not necessarily locally)
  epoch_t last_epoch_clean;    // lower bound on last epoch the PG was completely clean.
  epoch_t last_epoch_split;    // as parent

这些历史事件是不同pg副本所共有的，所以在merge的时候会做合并处理。

第二类是每个副本单独所有的，描述该副本的一些历史点

  epoch_t same_up_since;       // same acting set since
  epoch_t same_interval_since;   // same acting AND up set since
  epoch_t same_primary_since;  // same primary at least back through this epoch.

其中 same_interval_since用的多点，就像变量名的含义一样，它表示这个副本所经历的最后一个interval的起点。

第三类是scrub相关的，这个跟peering过程无直接关系。

最后编辑于：2017.12.05 06:57:10

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,761评论 5赞 460
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,953评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,998评论 0赞 320
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,248评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,130评论 4赞 356
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,145评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,550评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,236评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,510评论 1赞 291
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,601评论 2赞 310
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,376评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,247评论 3赞 313
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,613评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,911评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,191评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,532评论 2赞 342
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,739评论 2赞 335