1# Fossil and the CAP Theorem
2
3[The CAP theorem][cap] is a fundamental mathematical proof about
4distributed systems.  A software system can no more get around it than a
5physical system can get past *c*, the [speed of light][sol] constant.
6
7Fossil is a distributed system, so it can be useful to think about it in
8terms of the CAP theorem. We won’t discuss the theorem itself or how you
9reason using its results here. For that, we recommend [this article][tut].
10
11[cap]: https://en.wikipedia.org/wiki/CAP_theorem
12[sol]: https://en.wikipedia.org/wiki/Speed_of_light
13[tut]: https://www.ibm.com/cloud/learn/cap-theorem
14
15
16<a id="ap"></a>
17## Fossil Is an AP-Mode System
18
19As with all common [DVCSes][dvcs], Fossil is an AP-mode system, meaning
20that your local clone isn’t necessarily consistent with all other clones
21(C), but the system is always available for use (A) and
22partition-tolerant (P). This is what allows you to turn off Fossil’s
23autosync mode, go off-network, and continue working with Fossil, even
24though only a single node (your local repo clone) is accessible at the
25time.
26
27You may consider that going back online restores “C”, because upon sync,
28you’re now consistent with the repo you cloned from. But, if another
29user has gone offline in the meantime, and they’ve made commits to their
30disconnected repo, *you* aren’t consistent with *them.* Besides which,
31if another user commits to the central repo, that doesn’t push the
32change down to you automatically: even if all users of a Fossil system
33are online at the same instant, and they’re all using autosync, Fossil
34doesn’t guarantee consistency across the network.
35
36There’s no getting around the CAP theorem!
37
38[dvcs]: https://en.wikipedia.org/wiki/Distributed_version_control
39
40
41<a id="ca"></a>
42## CA-Mode Fossil
43
44What would it mean to redesign Fossil to be CA-mode?
45
46It means we get a system that is always consistent (C) and available (A)
47as long as there are no partitions (P).
48
49That’s basically [CVS] and [Subversion][svn]: you can only continue
50working with the repository itself as long as your connection to the central repo server functions.
51
52It’s rather trivial to talk about single-point-of-failure systems like
53CVS or Subversion as
54CA-mode. Another common example used this way is a classical RDBMS, but
55aren’t we here to talk about distributed systems? What’s a good example
56of a *distributed* CA-mode system?
57
58A better example is [Kafka], which in its default configuration assumes
59it being run on a corporate LAN in a single data center, so network
60partitions are exceedingly rare. It therefore sacrifices partition
61tolerance to get the advantages of CA-mode operation. In its particular application of
62this mode, a
63message isn’t “committed” until all running brokers have a copy of it,
64at which point the message becomes visible to the client(s). In that
65way, all clients always see the same message store as long as all of the
66Kafka servers are up and communicating.
67
68How would that work in Fossil terms?
69
70If there is only one central server and I clone it on my local laptop,
71then CA mode means I can only commit if the remote Fossil is available,
72so in that sense, it devolves to the old CVS model.
73
74What if there are three clones? Perhaps there is a central server *A*,
75the clone *B* on my laptop, and the clone *C* on your laptop. Doesn’t CA
76mode now mean that my commit on *B* doesn’t exist after I commit it to
77the central repo *A* until you, my coworker, *also* pull down the copy
78of that commit to your laptop *C*, validating the commit through the
79network?
80
81That’s one way to design the system, but another way would be to scope
82the system to only talk about proper *servers*, not about the clients.
83In that model, a CA-mode Fossil alternative might require 2+ servers to
84be running for proper replication. When I make a commit, if all of the
85configured servers aren’t online, I can’t commit. This is basically CVS
86with replication, but without any useful amount of failover.
87
88[CVS]:   https://en.wikipedia.org/wiki/Concurrent_Versions_System
89[Kafka]: https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
90[svn]:   https://en.wikipedia.org/wiki/Apache_Subversion
91
92
93<a id="cp"></a>
94## CP-Mode Fossil
95
96What if we modify our CA-mode system above with “warm spares”?  We can
97say that commits must go to all of the spares as well as the active
98servers, but a loss of one active server requires that one warm spare
99come into active state, and all of the clients learn that the spare is
100now considered “active.” At this point, you have a CP-mode system, not a
101CA-mode system, because it’s now partition-tolerant (P) but it becomes
102unavailable when there aren’t enough active servers or warm
103spares to promote to active status.
104
105CP is your classical [BFT] style distributed consensus system, where the
106system is available only if the client can contact a *majority* of the
107servers. This is a formalization of the warm spare concept above: with
108*N* server nodes, you need at least ⌊*N* / 2⌋ + 1 of them to be online
109for a commit to succeed.
110
111Many distributed database systems run in CP mode because consistency (C) and
112partition-tolerance (P) is a useful combination. What you lose is
113always-available (A) operation: with a suitably bad partition, the
114system goes down for users on the small side of that partition.
115
116An optional CP mode for Fossil would be attractive in some ways since in
117some sense Fossil is a distributed DBMS, but in practical terms, it
118means Fossil would then not be a [DVCS] in the most useful sense, being
119that you could work while your client is disconnected from the remote
120Fossil it cloned from.
121
122A fraught question is whether the non-server Fossil clones count as
123“nodes” in this sense.
124
125If they do count, then if there are only two systems, the central server
126and the clone on my laptop, then it stands to reason from the formula
127above that I can only commit if the central server is available. In that
128scheme, a CP-mode Fossil is basically like CVS.
129
130But what happens if my company hires a coworker to help me with the
131project, and this person makes their own clone of the central repo? The
132equation says I still need 2 nodes to be available for a commit, so if
133my new coworker goes off-network, that doesn’t affect whether I can make
134commits. Likewise, if I go off-network, my coworker can make commits to
135the central server.
136
137But what happens if the central server goes down? The equation says we
138still have 2 nodes, so we should be able to commit, right? Sure, but
139only if my laptop and communicate directly to my coworker’s laptop! If
140it can’t, that’s also a network partition, so *N=1* on both sides in
141that case. The implication is that for a true CP-mode Fossil, we’d need
142some kind of peer-to-peer networking layer so that our laptops can
143accept commits from the other, so that when the central server comes
144online, one of us can send the results up to it to get it caught up.
145
146But doesn’t that then mean there is no security? How does [Fossil’s RBAC
147system][caps] work if peer-to-peer commits are allowed?
148
149You can instead reconceptualize the system as “node” meaning only server
150nodes, so that client-only systems don’t count. This allows you to have
151an RBAC system again.
152
153With just one central server, ⌊1/2⌋+1=1, so you get CVS-like behavior:
154if the server’s up, you can commit.
155
156If you set up 2 servers for redundancy, both must be up for commits to
157be allowed, since otherwise you could end up with half the commits going
158to the server on one side of a network partition, half going to the
159other, and no way to arbitrate among the two once the partition is
160lifted.
161
162(Today’s AP-mode Fossil has this capability, but the necessary cost is
163“C”, consistency! Once again, you can’t get around the CAP theorem.)
164
1653 servers is more sensible: any client that can see at least 2 of them
166can commit.
167
168Will there ever be a CP-mode Fossil? This author doubts it, but as I’ve
169shown, it would be useful in contexts where you’d rather have a
170guarantee of consistency than availability.
171
172[BFT]:    https://en.wikipedia.org/wiki/Byzantine_fault
173[caps]:   ./caps/
174