1<title>Thoughts On The Design Of The Fossil DVCS</title>
2<h1 align="center">Thoughts On The Design Of The Fossil DVCS</h1>
3
4Two questions (or criticisms) that arise frequently regarding Fossil
5can be summarized as follows:
6
7  1.  Why is Fossil based on SQLite instead of a distributed NoSQL database?
8
9  2.  Why is Fossil written in C instead of a modern high-level language?
10
11Neither question can be answered directly because they are both
12based on false assumptions.  We claim that Fossil is not based on SQLite
13at all and that Fossil is not based on a distributed NoSQL database
14because Fossil is a distributed NoSQL database.  And, Fossil does use
15a modern high-level language for its implementation, namely SQL.
16
17<h2>Fossil Is A NoSQL Database</h2>
18
19We begin with the first question:  Fossil is not based on a distributed
20NoSQL database because Fossil <u><i>is</i></u> a distributed NoSQL database.
21Fossil is <u>not</u> based on SQLite.
22The current implementation of Fossil uses
23SQLite as a local store for the content of the distributed database and as
24a cache for meta-information about the distributed database that is precomputed
25for quick and easy presentation.  But the use of SQLite in this role is an
26implementation detail and is not fundamental to the design.  Some future
27version of Fossil might do away with SQLite and substitute a pile-of-files or
28a key/value database in place of SQLite.
29(Actually, that is very unlikely
30to happen since SQLite works amazingly well in its current role, but the point
31is that omitting SQLite from Fossil is a theoretical possibility.)
32
33The underlying database that Fossil implements has nothing to do with
34SQLite, or SQL, or even relational database theory.  The underlying
35database is very simple:  it is an unordered collection of "artifacts".
36An artifact is a list of bytes - a "file" in the usual manner of thinking.
37Many artifacts are simply the content of source files that have
38been checked into the Fossil repository.  Call these "content artifacts".
39Other artifacts, known as
40"control artifacts", contain ASCII text in a particular format that
41defines relationships between other artifacts, such as which
42content artifacts that go together to form a particular version of the
43project.  Each artifact is named by its SHA1 or SHA3-256 hash and is
44thus immutable.
45Artifacts can be added to the database but not removed (if we ignore
46the exceptional case of [./shunning.wiki | shunning].)  Repositories
47synchronize by computing the union of their artifact sets.  SQL and
48relation theory play no role in any of this.
49
50SQL enters the picture only in the implementation details.  The current
51implementation of Fossil stores each artifact as a BLOB in an SQLite
52database.
53The current implementation also parses up each control artifact as it
54arrives and stores the information discovered from that parse in various
55other SQLite tables to facilitate rapid generation of reports such as
56timelines, file histories, file lists, branch lists, and so forth.  Note
57that all of this additional information is derived from the artifacts.
58The artifacts are canonical.  The relational tables serve only as a cache.
59Everything in the relational tables can be recomputed
60from the artifacts, and in fact that is exactly what happens when one runs
61the "fossil rebuild" command on a repository.
62
63So really, Fossil works with two separate databases.  There is the
64bag-of-artifacts database which is non-relational and distributed (like
65a NoSQL database) and there is the local relational database.  The
66bag-of-artifacts database has a fixed format and is what defines a Fossil
67repository.  Fossil will never modify the file format of the bag-of-artifacts
68database in an incompatible way because to do so would be to make something
69that is no longer "Fossil".  The local relational database, on the other hand,
70is a cache that contains information derived from the bag-of-artifacts.
71The schema of the local relational database changes from time to time as
72the Fossil implementation is enhanced, and the content is recomputed from
73the unchanging bag of artifacts.  The local relational database is an
74implementation detail which currently happens to use SQLite.
75
76Another way to think of the relational tables in a Fossil repository is
77as an index for the artifacts.  Without the relational tables,
78to generate a report like a timeline would require scanning every artifact -
79the equivalent of a full table scan.  The relational tables hold pointers
80the relevant artifacts in presorted order so that generating a timeline
81is much more efficient.  So like an index in a relational database, the
82relational tables in an Fossil repository do not add any new information,
83they merely make the information in the artifacts faster and easier to
84look up.
85
86Fossil is not "based" on SQLite.  Fossil simply exploits SQLite as
87a powerful tool to make the implementation easier.
88And Fossil doesn't use a distributed
89NoSQL database because Fossil is a distributed NoSQL database.  That answers
90the first question.
91
92<h2>SQL Is A High-Level Scripting Language</h2>
93
94The second concern states that Fossil does not use a high-level scripting
95language.  But that is not true.  Fossil uses SQL (as implemented by SQLite)
96as its scripting language.
97
98This misunderstanding likely arises because people fail
99to appreciate that SQL is a programming language.  People are taught that SQL
100is a "query language" as if that were somehow different from a
101"programming language".  But they really are two different flavors of the
102same thing.  I find that people do better with SQL if they think of
103SQL as a programming language and each statement
104of SQL is a separate program.  SQL is a peculiar programming language
105in that one uses SQL to specify <i>what</i> to compute whereas in
106most other programming languages one specifies <i>how</i>
107to carry out the computation.
108This difference means that SQL
109is an extraordinary high-level programming language, but it is still
110just a programming language.
111
112For certain types of problems, SQL has a huge advantage over other
113programming languages because it is so high level and because it allows
114programmers to focus more on the <i>what</i> and less on the <i>how</i>
115of a computation.  In other words,
116programmers tend to think about problems at a much higher level when
117using SQL; this can result in better applications.
118SQL is also very dense.
119In practice, this often means that a few
120lines of SQL can often replace hundreds or thousands of lines of
121procedural code, with a corresponding decrease in programming effort
122and opportunities to introduce bugs.
123Fossil happens to be one of those problems for which SQL is well suited.
124
125Much of the "heavy lifting" within the Fossil implementation is carried
126out using SQL statements.  It is true that these SQL statements are glued
127together with C code, but it turns out that C works surprisingly well in
128that role.  Several early prototypes of Fossil were written in a scripting
129language (TCL).  We normally find that TCL programs are shorter than the
130equivalent C code by a factor of 10 or more.  But in the case of Fossil,
131the use of TCL was actually making the code longer and more difficult to
132understand.
133And so in the final design, we switched from TCL to C in order to make
134the code easier to implement and debug.
135
136Without the advantages of having SQLite built in, the design might well
137have followed a different path.  Most reports generated by Fossil involve
138a complex set of queries against the relational tables of the repository
139database.  These queries are normally implemented in only a few dozen
140lines of SQL code.  But if those queries had been implemented procedurally
141using a key/value or pile-of-files database, it
142may have well been the case that a high-level scripting language such as
143Tcl, Python, or Ruby may have worked out better than C.
144