1<title>Thoughts On The Design Of The Fossil DVCS</title> 2<h1 align="center">Thoughts On The Design Of The Fossil DVCS</h1> 3 4Two questions (or criticisms) that arise frequently regarding Fossil 5can be summarized as follows: 6 7 1. Why is Fossil based on SQLite instead of a distributed NoSQL database? 8 9 2. Why is Fossil written in C instead of a modern high-level language? 10 11Neither question can be answered directly because they are both 12based on false assumptions. We claim that Fossil is not based on SQLite 13at all and that Fossil is not based on a distributed NoSQL database 14because Fossil is a distributed NoSQL database. And, Fossil does use 15a modern high-level language for its implementation, namely SQL. 16 17<h2>Fossil Is A NoSQL Database</h2> 18 19We begin with the first question: Fossil is not based on a distributed 20NoSQL database because Fossil <u><i>is</i></u> a distributed NoSQL database. 21Fossil is <u>not</u> based on SQLite. 22The current implementation of Fossil uses 23SQLite as a local store for the content of the distributed database and as 24a cache for meta-information about the distributed database that is precomputed 25for quick and easy presentation. But the use of SQLite in this role is an 26implementation detail and is not fundamental to the design. Some future 27version of Fossil might do away with SQLite and substitute a pile-of-files or 28a key/value database in place of SQLite. 29(Actually, that is very unlikely 30to happen since SQLite works amazingly well in its current role, but the point 31is that omitting SQLite from Fossil is a theoretical possibility.) 32 33The underlying database that Fossil implements has nothing to do with 34SQLite, or SQL, or even relational database theory. The underlying 35database is very simple: it is an unordered collection of "artifacts". 36An artifact is a list of bytes - a "file" in the usual manner of thinking. 37Many artifacts are simply the content of source files that have 38been checked into the Fossil repository. Call these "content artifacts". 39Other artifacts, known as 40"control artifacts", contain ASCII text in a particular format that 41defines relationships between other artifacts, such as which 42content artifacts that go together to form a particular version of the 43project. Each artifact is named by its SHA1 or SHA3-256 hash and is 44thus immutable. 45Artifacts can be added to the database but not removed (if we ignore 46the exceptional case of [./shunning.wiki | shunning].) Repositories 47synchronize by computing the union of their artifact sets. SQL and 48relation theory play no role in any of this. 49 50SQL enters the picture only in the implementation details. The current 51implementation of Fossil stores each artifact as a BLOB in an SQLite 52database. 53The current implementation also parses up each control artifact as it 54arrives and stores the information discovered from that parse in various 55other SQLite tables to facilitate rapid generation of reports such as 56timelines, file histories, file lists, branch lists, and so forth. Note 57that all of this additional information is derived from the artifacts. 58The artifacts are canonical. The relational tables serve only as a cache. 59Everything in the relational tables can be recomputed 60from the artifacts, and in fact that is exactly what happens when one runs 61the "fossil rebuild" command on a repository. 62 63So really, Fossil works with two separate databases. There is the 64bag-of-artifacts database which is non-relational and distributed (like 65a NoSQL database) and there is the local relational database. The 66bag-of-artifacts database has a fixed format and is what defines a Fossil 67repository. Fossil will never modify the file format of the bag-of-artifacts 68database in an incompatible way because to do so would be to make something 69that is no longer "Fossil". The local relational database, on the other hand, 70is a cache that contains information derived from the bag-of-artifacts. 71The schema of the local relational database changes from time to time as 72the Fossil implementation is enhanced, and the content is recomputed from 73the unchanging bag of artifacts. The local relational database is an 74implementation detail which currently happens to use SQLite. 75 76Another way to think of the relational tables in a Fossil repository is 77as an index for the artifacts. Without the relational tables, 78to generate a report like a timeline would require scanning every artifact - 79the equivalent of a full table scan. The relational tables hold pointers 80the relevant artifacts in presorted order so that generating a timeline 81is much more efficient. So like an index in a relational database, the 82relational tables in an Fossil repository do not add any new information, 83they merely make the information in the artifacts faster and easier to 84look up. 85 86Fossil is not "based" on SQLite. Fossil simply exploits SQLite as 87a powerful tool to make the implementation easier. 88And Fossil doesn't use a distributed 89NoSQL database because Fossil is a distributed NoSQL database. That answers 90the first question. 91 92<h2>SQL Is A High-Level Scripting Language</h2> 93 94The second concern states that Fossil does not use a high-level scripting 95language. But that is not true. Fossil uses SQL (as implemented by SQLite) 96as its scripting language. 97 98This misunderstanding likely arises because people fail 99to appreciate that SQL is a programming language. People are taught that SQL 100is a "query language" as if that were somehow different from a 101"programming language". But they really are two different flavors of the 102same thing. I find that people do better with SQL if they think of 103SQL as a programming language and each statement 104of SQL is a separate program. SQL is a peculiar programming language 105in that one uses SQL to specify <i>what</i> to compute whereas in 106most other programming languages one specifies <i>how</i> 107to carry out the computation. 108This difference means that SQL 109is an extraordinary high-level programming language, but it is still 110just a programming language. 111 112For certain types of problems, SQL has a huge advantage over other 113programming languages because it is so high level and because it allows 114programmers to focus more on the <i>what</i> and less on the <i>how</i> 115of a computation. In other words, 116programmers tend to think about problems at a much higher level when 117using SQL; this can result in better applications. 118SQL is also very dense. 119In practice, this often means that a few 120lines of SQL can often replace hundreds or thousands of lines of 121procedural code, with a corresponding decrease in programming effort 122and opportunities to introduce bugs. 123Fossil happens to be one of those problems for which SQL is well suited. 124 125Much of the "heavy lifting" within the Fossil implementation is carried 126out using SQL statements. It is true that these SQL statements are glued 127together with C code, but it turns out that C works surprisingly well in 128that role. Several early prototypes of Fossil were written in a scripting 129language (TCL). We normally find that TCL programs are shorter than the 130equivalent C code by a factor of 10 or more. But in the case of Fossil, 131the use of TCL was actually making the code longer and more difficult to 132understand. 133And so in the final design, we switched from TCL to C in order to make 134the code easier to implement and debug. 135 136Without the advantages of having SQLite built in, the design might well 137have followed a different path. Most reports generated by Fossil involve 138a complex set of queries against the relational tables of the repository 139database. These queries are normally implemented in only a few dozen 140lines of SQL code. But if those queries had been implemented procedurally 141using a key/value or pile-of-files database, it 142may have well been the case that a high-level scripting language such as 143Tcl, Python, or Ruby may have worked out better than C. 144