Spanner is Google’s highly available global SQL database [CDE+12]. It manages replicated data at great scale, both in terms of size of data and volume of transactions. It assigns globally consistent real-time timestamps to every datum written to it, and clients can do globally consistent reads across the entire database without locking.
Spanner是谷歌的高可用全球化数据库,他管理着无论从数据量还是事务量上来说都是大规模的数据副本。在每一次数据写入的时候Spanner都会分配一个具有全局一致的实时时间戳,这样客户端可以在不需要锁定的情况下对整个数据库做全局一致的读。
The CAP theorem [Bre12] says that you can only have two of the three desirable properties of:
• C: Consistency, which we can think of as serializability for this discussion;
• A: 100% availability, for both reads and updates;
• P: tolerance to network partitions.
CAP理论告诉我们只能从以下特性中选择两个:
• C: 一致性,对本文来说,我们可以把它理解成可串行性。
• A: 100%读写可用
• P: 网络分区容忍性
This leads to three kinds of systems: CA, CP and AP, based on what letter you leave out. Note that you are not entitled to 2 of 3, and many systems have zero or one of the properties.
3选2的定律导致了会出现CA,CP和AP这三种类型的系统,完全取决于你要放弃哪个特性。同时也请注意并不是每个系统天生下来就具备3者中的两个,很大一部分系统存在没有或者只有一个特性的情况。
For distributed systems over a “wide area”, it is generally viewed that partitions are inevitable, although not necessarily common [BK14]. Once you believe that partitions are inevitable, any distributed system must be prepared to forfeit either consistency (AP) or availability (CP), which is not a choice anyone wants to make. In fact, the original point of the CAP theorem was to get designers to take this tradeoff seriously.But there are two important caveats: first, you only need forfeit something during an actual partition, and even then there are many mitigations (see the “12 years” paper [Bre12]). Second, the actual theorem is about 100% availability, while the interesting discussion here is about the tradeoffs involved for realistic high availability.
从广义上来说,分布式系统一般没办法避免网络分区这种情况,尽管这种情况不是时常发生。如果你一旦理解了网络分区无法回避,那么所有分布式系统都需要做好放弃一致性或者可用性的准备。不过事实上,CAP理论的初衷就是想让系统设计者认真的去权衡这个问题。但是这里有两个很重要的说明:第一,你只需要在网络分区真正发生的时候才去考虑放弃C和A中的一个(译者注:因为在没有发生网络分区的时候是可以保证CA同时存在的,只是在网络分区发生的时候对于C和A需要权衡一下),而且咱们还有许多可以折衷的办法。第二,理论中提及的可用性是100%,但是我们讨论更加实际(译者注:就是传说中的N个9)。
Spanner claims to be consistent and available
Spanner声明他们同时具备一致性和可用性。
Despite being a global distributed system, Spanner claims to be consistent and highly available, which implies there are no partitions and thus many are skeptical.Does this mean that Spanner is a CA system as defined by CAP? The short answer is “no” technically, but “yes” in effect and its users can and do assume CA.
尽管作为一个全球分布式系统,但是Spanner声明他同时具备一致性和可用性,这意味着不会有网络分区的情况出现,因此导致很多人质疑。这是否意味着Spanner是一个符合CAP定义的CA系统呢?简单的说,从技术层面来讲:不是,但是从效果上来讲,用户可以假设他是。
The purist answer is “no” because partitions can happen and in fact have happened at Google, and during (some) partitions, Spanner chooses C and forfeits A. It is technically a CP system. We explore the impact of partitions below.
更为纯粹的答案是"no"因为网络分区可能会发生并且在google也实际发生过,并且在网络分区发生的情况下,Spanner选择保留C但是放弃A。这从技术角度上来说是一个CP系统。接下来我们将要探索网络分区带来的影响。
Given that Spanner always provides consistency, the real question for a claim of CA is whether or not Spanner’s serious users assume its availability. If its actual availability is so high that users can ignore outages, then Spanner can justify an “effectively CA” claim. This does not imply 100% availability (and Spanner does not and will not provide it), but rather something like 5 or more “9s” (1 failure in 10的5次方 or less).In turn, the real litmus test is whether or not users (that want their own service to be highly available) write the code to handle outage exceptions: if they haven’t written that code, then they are assuming high availability. Based on a large number of internal users of Spanner, we know that they assume Spanner is highly available.
考虑到Spanner总是保证一致性,那么Spanner实际面临的问题是用户是否承认他的可用性。如果实际上的可用性高到用户可以忽视不可用时间,那么Spanner就可以证明自己是一个"有效的CA"。但这并不意味着100%可用(Spanner现在没有,以后也不会对此作出保证),但是提供诸如5个9之类的保障是可以提供的(失败概率<=1/10^5)。高可用真正的试金石其实是那些希望自己的服务是真正高可用的用户,他们是否会编写代码来处理那种运行中断的异常:没有写那种代码的用户都是假设Spanner可用性足够好。基于Spanner大量的内部用户,我们知道他们认为Spanner是高可用的。
A second refinement is that there are many other sources of outages, some of which take out the users in addition to Spanner (“fate sharing”). We actually care about the differential availability, in which the user is up (and making a request) to notice that Spanner is down. This number is strictly higher (more available) than Spanner’s actual availability — that is, you have to hear the tree fall to count it as a problem.
第二个细节就是除了Spanner之外还存在许多让系统运行中断的原因。实际上我们关心的是差异化的可用性,就是用户已经注意到Spanner已经确实不可用了。这个差异化的可用性比Spanner实际上的可用性要高很多--意思就是,你必须要听到树确实倒了才能把它作为一个问题(译者注:这段不太好翻,我本人理解的意思就是Spanner本身的可用性不能以你的系统可用性来替代,Spanner本身的可用性比你系统的还要高,Spanner本身的可用性比他对外宣称的还要高)
A third issue is whether or not outages are due to partitions. If the primary causes of Spanner outages are not partitions, then CA is in some sense more accurate. For example, any database cannot provide availability if all of its replicas are offline, which has nothing to do with partitions. Such a multi-replica outage should be very rare, but if partitions are significantly more rare, then you can effectively ignore partitions as a factor in availability. For Spanner, this means that when there is an availability outage, it is
not in practice due to a partition, but rather some other set of multiple faults (as no single fault will forfeit availability).
第三个问题就是导致系统中断的问题是否来源于网络分区。如果Spanner运行中断的主要原因不是网络分区,那么CA这个说法在某种意义上更加正确。举个例子,任何数据库都不能在他所有副本都下线的情况下依然提供可用性,这和是否网络分区完全没有关系。多副本同时不可用的情况是十分罕见的,但如果网络分区更加罕见的话用户实际上就可以忽略网络分区作为影响可用性的一个因素了。对Spanner来说,这意味着当出现可用性中断时,并非由网络分区引起,而是由其他一系列的错误一起导致(因为单一的故障不会导致丢失可用性)(译者注:更像是在表达单一服务器故障并不会影响整个集群的可用性,因为还有其他副本可以用,但是如果大量的故障发生比如大量服务器宕机才会导致可用性下降)。
Availability data
可用性的数据统计
Before we get to Spanner, it is worth taking a look at the evolution of Chubby, another wide-area system that provides both consistency and availability. The original Chubby paper [Bur06] mentioned nine outages of 30 seconds or more in 700 days, and six of those were network related (as discussed in [BK14]). This corresponds to an availability worse than 5 9s (at best), to a more realistic 4 9s if we assume an average of 10 minutes per outage, and potentially even 3 9s at hours per outage.
在我们使用Spanner之前,值得先讨论下Chubby(另一个广域上的同时提供CA的系统)的演进过程。在Chubby的论文中提到在700天中发生了9次运行中断,每次中断时间到达30s或者更多,9次中的6次与网络有关。往好了讲了,可用性比5个9还差点,如果每次中断时间有10分钟的话可靠性大概在4个9,如果是数小时的话,可能只有3个9。
For locking and consistent read/write operations, modern geographically distributed Chubby cells provide an average availability of 99.99958% (for 30s+ outages) due to various network, architectural and operational improvements. Starting in 2009, due to “excess” availability, Chubby’s Site Reliability Engineers (SREs) started forcing periodic outages to ensure we continue to understand dependencies and the impact of Chubby failures.
为了加锁操作和一致性的读/写操作,随着架构和运维手段的改进现在的分布式Chubby集群可以在各种各样的网络环境下提供平均99.99958%的可靠性(大概30s的不可用时间)。从2009年起,因为"过量"的可用性,Chubby的SREs(网站可靠性工程师)开始周期性的强制进行故障演练来保证我们了解Chubby失败之后带来的影响。(译者注:大意就是Chubby很难发生故障以至于SREs需要手动制造故障让用户熟悉当Chubby失败时的影响)
Internally, Spanner provides a similar level of reliability to Chubby; that is, better than 5 9s. The Cloud version has the same foundation, but adds some new pieces, so it may be a little lower in practice for a while.
在内部,Spanner提供了和Chubby相似的可靠性;即高于5个9。云服务版包含了相同的基础设施,而且还新增了些许新的功能,所以云服务版本的性能较内部版更低一些。
The pie chart above reveals the causes of Spanner incidents internally. An incident is an unexpected event, but not all incidents are outages; some can be masked easily. The chart is weighted by frequency not by impact. The bulk of the incidents (User) are due to user errors such as overload or misconfiguration and mostly affect that user, whereas the remaining categories could affect all users in an area. Cluster incidents reflect non-network problems in the underlying infrastructure, including problems with servers and power. Spanner automatically works around these incidents by using other replicas; however, SRE involvement is sometimes required to fix a broken replica. Operator incidents are accidents induced by SREs, such as a misconfiguration. Bug implies a software error that caused some problem; these can lead to large or small outages. The two biggest outages were both due to software bugs that affected all replicas of a particular database at the same time. Other is grab bag of various problems, most of which occurred only once.
上面的饼图揭示了引起Spanner事故的内部原因。事故都是突发事件,但是不是所有的事故都导致了运行中断;有些很容易就能处理掉。图表中的占比指的是发生的次数而不是指影响。
User--大部分的事件(由用户引起的)是因为人为错误比如像overload,配置出问题这种并且这一类问题仅仅只影响出问题的这部分用户,而剩下的问题可能会影响一个区域内的所有用户。
Cluster--集群出现事故反应了基础设施中潜在的非网络问题,包含服务器或者电力。Spanner会使用其他的副本自动修复这些问题;然而,有时也需要SRE介入修复损坏的副本。
Operator--运维事故由SRE引发,比如配置错误
Bug--Bug意味着某些错误导致的软件问题;这可能会导致大小不一的运行中断。两次最大的运行中断事故都是由于软件bug导致,这两次事故在同一时间影响了某个数据库的所有备份。
Other--其他的都是大多数只发生过一次的问题。
The Network category, under 8%, is where partitions and networking configuration problems appear.There were no events in which a large set of clusters were partitioned from another large set of clusters.Nor was a Spanner quorum ever on the minority side of a partition. We did see individual data centers or regions get cut off from the rest of the network. We also had some misconfigurations that under-provisioned bandwidth temporarily, and we saw some temporary periods of bad latency related to hardware failures. We saw one issue in which one direction of traffic failed, causing a weird partition that had to be resolved by bringing down some nodes. So far, no large outages were due to networking incidents.
由于网络分区或者网络配置出问题导致的这一类统一归类为网络问题,占比小于8%。没有出现过那种两个较大集群间的网络分区事件。也没有出现过Spanner quorum成为少数派。我们曾经遇到过个别的数据中心或者区域与其他网络断开。我们也遇到过有些配置错误造成比如临时带宽供应不足,硬件问题造成的周期性的延迟。还有个由于单向通信失败导致产生了一个奇怪的分区,使得我们不得不关闭某些节点来应对。不过到目前为止,没出现过由于网络问题导致的大型运行中断事故。
Summarizing, to claim “effectively CA” a system must be in this state of relative probabilities: 1) At a minimum it must have very high availability in practice (so that users can ignore exceptions), and 2) as this is about partitions it should also have a low fraction of those outages due to partitions. Spanner meets both.
总而言之,一个系统被声明为“有效的CA”系统必须处于以下相对概率的状态(译者注:这块太难理解了,只能字面翻译了)
1)在实际情况中必须保证高可用(这样用户才能忽略异常(译者注:这个异常并非我们写代码的那种异常,应该理解为非正常运行的情况))
2)由于网络分区导致的运行中断占比要少
Spanner都做到了
It’s the network
这就是网络
Many assume that Spanner somehow gets around CAP via its use of TrueTime, which is a service that enables the use of globally synchronized clocks. Although remarkable, TrueTime does not significantly help achieve CA; its actual value is covered below. To the extent there is anything special, it is really Google’s wide-area network, plus many years of operational improvements, that greatly limit partitions in practice, and thus enable high availability.
很多人猜想Spanner通过TrueTime(一个全局同步时钟服务)可以绕过CAP理论的限制。尽管TrueTime很棒,但是他对我们达到CA并没有明显的帮助;TrueTime相关知识我们之后再介绍。从某种程度上来说,Spanner的特别之处在于他使用的Google的广域网络加上多年逐步提高的运维经验能很好的限制网络分区在实际情况下的发生,藉此使得高可用成为了可能。
First, Google runs its own private global network. Spanner is not running over the public Internet — in fact,
every Spanner packet flows only over Google-controlled routers and links (excluding any edge links to remote clients). Furthermore, each data center typically has at least three independent fibers connecting it to the private global network, thus ensuring path diversity for every pair of data centers.Similarly, there is redundancy of equipment and paths within a datacenter. Thus normally catastrophic events, such as cut fiber lines, do not lead to partitions or to outages.
首先,Google拥有私有的网络。Spanner并不是运行在公网上--事实上所有的Spanner数据包都经过Google的路由和链路(除了到达远程客户端的边缘链路)。此外,每一个数据中心都有至少3根独立的光纤连接到Google的私有网络,从而确保了每两个数据中心之间的路径有多条。在在数据中心内部同样也有冗余的设备和线路。因此一般的灾难性事件,比如光纤被挖断了之类的,并不会导致网络分区或者运行时中断。
The real risk for a partition is thus not a cut path, but rather some kind of broad config or software upgrade that breaks multiple paths simultaneously. This is a real risk and something that Google continues to work to prevent and mitigate. The general strategy is to limit the impact (or “blast radius”) of any particular update, so that when we inevitably push a bad change, it only takes out some paths or some replicas. We then fix those before attempting any other changes.
真正的风险并不在于某一条链路断了而是在于广泛应用的配置或者软件升级的过程中导致了大量链路同时被损坏。对于这种风险Google在不停的优化来预防和减轻其造成的后果。通常的应对策略是限制特定升级的影响范围(或者叫做“爆炸半径”),这样当我们已经提交了一个错误的更新时,他只会影响到部分链路和副本。在尝试推送其他更新之前我们只要把错误修复掉就好了(译者注:应该类似于切流之类的,更新逐步开放使得风险可控)。
Although the network can greatly reduce partitions, it cannot improve the speed of light. Consistent operations that span a wide area have a significant minimum round trip time, which can be tens of milliseconds or more across continents. (A distance of 1000 miles is about 5 million feet, so at ½ foot per nanosecond, the minimum would be 10 ms.) Google defines “regions” to have a 2ms round-trip time, so that regional offerings provide a balance between latency and disaster tolerance. Spanner mitigates latency via extensive pipelining of transactions, but that does not help single-transaction latency. For reads, latency is typically low, due to global timestamps and the ability to use a local replica (covered below).
尽管Google私有网络可以极大的减少网络分区的产生,但是光速是有限的。跨越很远距离的一致性的操作无法忽视网络传输的时间(Round-trip time RTT),跨越一个洲大概需要几十毫秒甚至更多的时间。(假设一个洲大约1000英里即约等于500w英尺,而光速是0.5英尺每纳秒,这么算下来时间最短也要10ms),Google定义了"regions"限制RTT在2ms以内,这样的定义使得regional在延迟和容灾间取得了一个平衡。虽然Spanner通过批量化处理事务来缓解延迟,但是单一事务的延迟依旧无法避免。对于读操作,由于全局的时间戳和使用本地副本的能力(接下来会介绍)延迟通常很低。
A model with weaker consistency could have lower update latency. However, without the long round trip it would also have a window of lower durability, since a disaster could take out the local site and delete all copies before the data is replicated to another region.
强调弱一致性的模型可以带来更低的写延迟。不管怎样,如果距离不够远就一定会有一个可靠性低的时间窗口,因为灾难可能在你将数据同步至其他区域之前将你的本地站点弄挂并且将你的所有副本全部毁坏。
What happens during a Partition
当网络分区出现时会发生什么
To understand partitions, we need to know a little bit more about how Spanner works. As with most ACID databases, Spanner uses two-phase commit (2PC) and strict two-phase locking to ensure isolation and strong consistency. 2PC has been called the “anti-availability” protocol [Hel16] because all members must be up for it to work. Spanner mitigates this by having each member be a Paxos group, thus ensuring each 2PC “member” is highly available even if some of its Paxos participants are down. Data is divided into groups that form the basic unit of placement and replication.
想要理解网络分区,我们需要了解一点关于Spanner如何运行的知识。和大多数ACID数据库一样,Spanner使用两阶段提交(2PC)和严格的两阶段锁来保证隔离性和强一致性。2PC之所以被称之为“反可用性”协议是因为他要求在事务执行的过程中所有成员必须可用。Spanner通过将每个成员都作为一个Paxos组(译者注:惊艳的想法,将Paxos作为逻辑上的一个成员)的方式来缓解这一点,由此来保证2PC的“成员”是高可用的(即使某些Paxos参与者不可用)。数据被划分到这些Paxos组中并且组成了放置和复制的基本单元。
As mentioned above, in general Spanner chooses C over A when a partition occurs. In practice, this is due
to a few specific choices:
• Use of Paxos groups to achieve consensus on an update; if the leader cannot maintain a quorum due to a partition, updates are stalled and the system is not available (by the CAP definition). Eventually a new leader may emerge, but that also requires a majority.
• Use of two-phase commit for cross-group transactions also means that a partition of the members can prevent commits.
上面提到,当网络分区发生时Spanner通常会选择C而不是A。实际上是因为以下一些考虑:
• 使用Paxos组来达成更新操作的共识;如果leader因为网络分区不能维持quorum,更新操作将被暂停并且系统不可用(通过CAP定义可得)(译者注:此时是CP)。直到组内大多数节点可用并且选出一个新的leader。
• 跨组使用两阶段提交意味着网络分区可能阻碍提交。
The most likely outcome of a partition in practice is that one side has a quorum and will continue on just fine, perhaps after electing some new leaders. Thus the service continues to be available, but users on the minority side have no access. But this is a case where differential availability matters: those users are likely to have other significant problems, such as no connectivity, and are probably also down. This means that multi-region services built on top of Spanner tend to work relatively well even during a partition. It is possible, but less likely, that some groups will not be available at all.
实际情况下最容易出现网络分区的原因是一边有quorum并且正常运行,有可能将重新选择leader。因此服务是可用的,但是在少数的那一边用户却无法访问。但这是差异化可用性很重要的情况:那些用户可能会遇到其他严重的问题,比如无法链接,已经宕机等等。这意味着构建在Spanner之上的多区域服务即使在分区过程中也可以工作得相对较好。当然也可能某些组完全不可用。
Transactions in Spanner will work as long as all of the touched groups have a quorum-elected leader and are on one side of the partition. This means that some transactions work perfectly and some will time out, but they are always consistent. An implementation property of Spanner is that any reads that return are consistent, even if the transaction later aborts (for any reason, including time outs).
Spanner的事务只要所有的组内存在quorum选举的leader并且都在网络分区的同一边就能正常工作。这意味着一些事务可以正常完成但是有一些会超时,但他们依然保持一致性。Spanner的一个特性是遵循读一致性,即使这个事务之后被中止(无论什么原因,包括超时)。
In addition to normal transactions, Spanner supports snapshot reads, which are read at a particular time in the past. Spanner maintains multiple versions over time, each with a timestamp, and thus can precisely answer snapshot reads with the correct version. In particular, each replica knows the time for which it is caught up (for sure), and any replica can unilaterally answer a read before that time (unless it is way too old and has been garbage collected). Similarly, it is easy to read (asynchronously) at the same time across many groups. Snapshot reads do not need locks at all. In fact, read-only transactions are implemented as a snapshot read at the current time (at any up-to-date replica).
除了支持正常的事务操作,Spanner也支持快照读(读取之前特定时间的数据快照)。Spanner随着时间的推移保留了多个版本,每个版本都有一个时间戳,所以它可以使用带版本号的快照读精确的获取结果。每一个副本都知道自己被询问的时间(当然),并且每一个副本都能独自返回快照读的结果(除非太久远的或者已经被垃圾回收的)(译者注:我的理解是数据快照使用时间当版本号,当在某个特定时间执行快照读时,Spanner会选择最新的快照版本返回)。类似的,在同一时间跨多个组异步读取也是非常简单的。快照读完全不需要锁。事实上,只读类型的事务等价于当前时间在任意最新的副本上的快照读。
Snapshot reads are thus a little more robust to partitions. In particular, a snapshot read will work if:
- There is at least one replica for each group on the initiating side of the partition, and
- The timestamp is in the past for those replicas.
在网络分区的情况下快照读比其他操作更加具有健壮性。只要能满足以下两个条件,快照读就能正常工作:
1.每个组至少一个副本存在
2.副本中包含给定时间戳的版本
The latter might not be true if the leader is stalled due to a partition, and that could last as long as the partition lasts, since it might not be possible to elect a new leader on this side of the partition. During a partition, it is likely that reads at timestamps prior to the start of the partition will succeed on both sides of the partition, as any reachable replica that has the data suffices.
如果因为网络分区导致leader没法正常工作那么上面两个条件将不成立,这种情况可能会持续到网络分区结束,因为在网络分区的一侧可能无法选举出新的leader。在分区期间,如果读取分区开始之前的时间副本的话在分区的两侧都会成功,毕竟只要副本可达并且有相应的数据就足够了。
What about TrueTime?
什么是TrueTime
In general, synchronized clocks can be used to avoid communication in a distributed system. Barbara Liskov provides a fine overview with many examples [Lis91].For our purposes, TrueTime is a global synchronized clock with bounded non-zero error: it returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. Thus, if two intervals do not overlap, then we know calls were definitely ordered in real time. If the intervals overlap, we do not know the actual order.
通常同步时钟用来避免分布式系统中的互相通信。Barbara Liskov提供了一个不错的概述也举了很多例子。对我们而言,TrueTime是一个有界非0误差的全局同步时钟:它会返回一个时间区间,并且保证真正请求执行时间会落在这个区间之内。因此,如果两个区间不重叠,那我们就能推测这两次请求按实际时间的顺序。反之如果有重叠,我们就无法推断出排序。
One subtle thing about Spanner is that it gets serializability from locks, but it gets external consistency (similar to linearizability) from TrueTime. Spanner’s external consistency invariant is that for any two transactions, T1 and T2 (even if on opposite sides of the globe):if T2 starts to commit after T1 finishes committing, then the timestamp for T2 is greater than the timestamp for T1.
Spanner的一个微妙之处在于它通过锁来获得串行处理的特性,而外部一致性(类似于线性一致性)则是通过TrueTime。Spanner可以实现外部一致性的不变量在于对于任意两个事务T1,T2(即使他们在地球的两端):如果T2在T1结束之后在提交,那么T2的时间戳一定大于T1.
Quoting from Liskov [Lis91, section 7]:
Synchronized clocks can be used to reduce the probability of having a violation of external consistency. Essentially the primary holds leases, but the object in question is the entire replica group. Each message sent by a backup to the primary gives the primary a lease. The primary can do a read operation unilaterally if it holds unexpired leases from a sub-majority of backups. …
The invariant in this system is: whenever a primary performs a read it holds valid leases from a sub-majority of backups. This invariant will not be preserved if clocks get out of synch.
引用自Liskov [Lis91, section 7]:
同步时钟可以用来减少违反外部一致性的可能性。本质上来讲,主需要持有租约,但是所讨论的对象是整个副本组。每个从备份发向主的消息给予了主一个租约。 如果主持有来自大多数备份的未到期租约,则可以单方面进行读取操作。
该系统中的不变量是:每当主节点执行读取时,它持有来自多数备份节点的有效租约。如果时钟不同步,这个不变量将不再成立。(译者注:同步时钟有点复杂,没理解只能按字面意思翻译了,sub-majority=majority-1)
Spanner’s use of TrueTime as the clock ensures the invariant holds. In particular, during a commit, the leader may have to wait until it is sure the commit time is in the past (based on the error bounds).This “commit wait” is not a long wait in practice and it is done in parallel with (internal) transaction communication. In general, external consistency requires monotonically increasing timestamps, and “waiting out the uncertainty” is a common pattern.
Spanner使用TrueTime作为时钟保证持有不变量。特别是在一次提交期间,leader必须要等待直到提交时间被确定发生在过去(基于误差边界)。实际上“等待提交”的时间不会太长并且它与内部事务通信是并发执行的。通常外部一致性需要单调递增时间戳并且“等待不确定性结束”也是一种常见的模式。
Spanner aims to elect leaders for an extended time, typically 10 seconds, by using renewable leases for elected leaders. As discussed by Liskov, every time a quorum agrees on a decision the lease is extended, as the participants just verified that the leadership is effective. When a leader fails there are two options: 1) you can wait for the lease to expire and then elect a new leader, or 2) you can restart the old leader, which might be faster. For some failures, we can send out a “last gasp” UDP packet to release the lease, which is an optimization to speed up expiration. As unplanned failures are rare in a Google data center, the long lease makes sense. The lease also ensures monotonicity of time across leaders, and enables group participants to serve reads within the lease time even without a leader.
Spanner目标在于通过可再生的租约延长选举leader的时间,一般来说是10s。正如Liskov所说,每当quorum对某提案达成共识时租约就会延长,因为参与者刚刚校验过leader是有效的。如果leader失效了那么我们有两个方案:1)等待租约到期然后重新选举一个leader,2)重启老的leader,可能比方案1更快点。对于某些失效来说,我们可以发出一个"终结式"的UDP的数据包来释放租约,这是一个加速租约失效的优化手段。因为计划外的错误在Google数据中心是非常罕见的,所以长期租约是合理的。租约保证了leader之间时间增长的单调性,保证了组内的参与者们能够持续的提供读服务即使在不存在leader的情况下。
However, the real value of TrueTime is in what it enables in terms of consistent snapshots. Stepping back a bit, there is a long history of multi-version concurrency-control systems (MVCC) [Ree78] that separately keep old versions and thus allow reading past versions regardless of the current transactional activity. This is a remarkably useful and underrated property: in particular, in Spanner snapshots are consistent (for their time) and thus whatever invariants hold for your system, they will also hold for the snapshot. This is true even when you don’t know what the invariants are! Essentially, snapshots are taken in between consecutive transactions and reflect everything up to the time of the snapshot, but nothing more. Without transactionally consistent snapshots, it is difficult to restart from a past time, as the contents may reflect a partially applied transaction that violates some invariant or integrity constraint. It is the lack of consistency that sometimes makes restoring from backups hard to do; in particular, this shows up as some corruption that needs to be fixed by hand.
然而,TrueTime的真正价值在于它保障了一致性快照读的能力。稍微往回看,长时间以来MVCC系统将老版本数据单独保存并且因此获得了无视当前事务活动而读取老版本数据的能力。这是一项被忽略的但是非常有用的属性:特别是在Spanner中快照是一致的(对于他们的时间),因此不管系统的不变量是什么,他们也将持有快照。即使你连什么是不变量也不知道也丝毫不影响!本质上快照是从连续不断的事务中抓取的,并且反映了截止到快照时的所有内容,不多也不少。没有事务一致性的快照将很难从过去的某一个时刻恢复,因为快照可能会含有某些未完成的事务的数据,这些数据会导致违反某些不变量或者完整性约束。从备份恢复会因为一致性的缺失而变得十分困难;特别是某些数据污染可能需要手动修复。
For example, consider using MapReduce to perform a large analytics query over a database. On Bigtable, which also stores past versions, the notion of time is “jagged” across the data shards, which makes the results unpredictable and sometimes inconsistent (especially for the very recent past). On Spanner, the same MapReduce can pick a precise timestamp and get repeatable and consistent results.
举个例子,考虑使用MapReduce来执行一次大规模数据库查询分析。对于Bigtable,也是存储着多种数据版本,但是在数据分片之间时间的概念是"锯齿状"的,这就会导致结果不可预测,有些时候还会不一致(尤其是较近的数据)。但是在Spanner上,MapReduce可以获取精确的时间戳并且可以获得可重复的(译者注:为什么不用幂等来描述?)一致的结果。
TrueTime also makes it possible to take snapshots across multiple independent systems, as long as they use (monotonically increasing) TrueTime timestamps for commit, agree on a snapshot time, and store multiple versions over time (typically in a log). This is not limited to Spanner: you can make your own transactional system and then ensure snapshots that are consistent across both systems (or even k systems). In general, you need a 2PC (while holding locks) across these systems to agree on the snapshot time and confirm success, but the systems need not agree on anything else, and can be wildly different.
TrueTime也使得在多个独立系统间获取快照成为了可能,只要他们使用(单调递增)TrueTime 时间戳来提交,就快照时间达成一致,随时间推移存储多版本数据(一般使用log)。这不局限于Spanner:你也可以自己实现自己的事务性系统并且保证快照在所有系统间是一致的。一般来说,你需要2PC(通过持有锁)来使得多个系统在快照时间上达成一致并且确保正确,除此之外系统间不需要就其他事情达成一致甚至可以有巨大的差异。
You can also use timestamps as tokens passed through a workflow. For example, if you make an update to a system, you can pass the time of that update to the next stage of the workflow, so that it can tell if its system reflects time after that event. In the case of a partition, this may not be true, in which case the next stage should actually wait if it wants consistency (or proceed if it wants availability). Without the time token, it is hard to know that you need to wait. This isn’t the only way to solve this problem, but it does so in a graceful robust way that also ensures eventual consistency. This is particularly useful when the different stages share no code and have different administrators — both can agree on time with no communication.
你也可以使用时间戳作为令牌在工作流中传递。比如,如果你对系统做一次升级,你可以把这次升级的时间透传到工作流的下一个阶段,这样可以确定系统时间是否在此次升级事件发生之后。在网络分区的情况下,这样做可能不靠谱,在这种情况下下一阶段如果想要保证一致性那就继续等待(如果想要可用性那就继续执行下去)。没有事件令牌的话很难知道是否需要继续等待。事件令牌并不是解决这个问题的为题方法,但是他确实是一种兼具优雅和健壮性的保证最终一致性的方式。在工作流的不同阶段没有任何代码关联,甚至连管理员也不一样的话,这确实是一个特别有用的方案--可以使得系统间不需要通信就在时间上达成一致。
Snapshots are about the past, but you can also agree on the future. A feature of Spanner is that you can agree on the time in the future for a schema change. This allows you to stage the changes for the new schema so that you are able to serve both versions. Once you are ready, you can pick a time to switch to the new schema atomically at all replicas. (You can also pick the time before you stage, but then you might not be ready by the target time.) In theory at least, you can also do future operations, such as a scheduled delete or a change in visibility.
快照代表的是过去,但是你也可以基于它在未来达成一致。Spanner的一个特性是你可以在未来就模式变更达成一致。这允许你可以把新模式的更改暂存起来,以便能够同时提供这两个版本。一旦你准备好了,你可以选择一个时间让所有复制集原子切换到新的模式。(你也可以选择暂存之前的时间,只是你可能没有准备好)至少在理论上,你还可以执行未来的操作,例如计划删除或可见性更改。
TrueTime itself could be hindered by a partition. The underlying source of time is a combination of GPS receivers and atomic clocks, both of which can maintain accurate time with minuscule drift by themselves. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
TrueTime本身会被网络分区阻碍。本质上来说时间的来源是GPS接收器+原子时钟,他们都通过自身极小的漂移维护了一个精确的时间。每个数据中心也有"time masters"(冗余的),有可能网络分区的两边都能继续获取准确的时间。然而,个别节点需要通过网络连接到"Time Masters",如果不能维持这个连接的话他们自己的时钟将会漂移。依赖于TrueTime的操作,比如Paxos的leader选举或者事务的提交都会因此等待一段时间,最终这些操作都会完成(假设2PC正常工作,quorum间的通信是正常的)。
Conclusion
结论
Spanner reasonably claims to be an “effectively CA” system despite operating over a wide area, as it is always consistent and achieves greater than 5 9s availability. As with Chubby, this combination is possible in practice if you control the whole network, which is rare over the wide area. Even then, it requires significant redundancy of network paths, architectural planning to manage correlated failures, and very careful operations, especially for upgrades. Even then outages will occur, in which case Spanner chooses consistency over availability.
Spanner有理由声明为一个广域伤的"高效CA"系统,因为它始终提供一致性和高于5个9的可用性。和Chubby一样,CA的结合在实际中是可能的,但是整个网络必须在你的掌控之中,但这在广域上很罕见的。除此之外,他还需要网络链路的冗余,管理相关错误的架构方案,细心的运维尤其是升级。即使中断出现,在这种情况下Spanner也会优先选择一致性而不是可用性。
Spanner uses two-phase commit to achieve serializability, but it uses TrueTime for external consistency,
consistent reads without locking, and consistent snapshots.
Spanner使用两阶段提交来达到可串行性,至于外部一致性靠的是TrueTime,无锁定的一致性读还有一致性的快照。
Acknowledgements:
Thanks in particular to Spanner and TrueTime experts Andrew Fikes, Wilson Hsieh, and Peter Hochschild. Additional thanks to Brian Cooper, Kurt Rosenfeld, Chris Taylor, Susan Shepard, Sunil Mushran, Steve Middlekauff, Cliff Frey, Cian Cullinan, Robert Kubis, Deepti Srivastava, Sean Quinlan, Mike Burrows, and Sebastian Kanthak.
原文链接:
https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45855.pdf