1# libCODY: COmpiler DYnamism<sup><a href="#1">1</a></sup> 2 3Copyright (C) 2020 Nathan Sidwell, nathan@acm.org 4 5libCODY is an implementation of a communication protocol between 6compilers and build systems. 7 8**WARNING:** This is preliminary software. 9 10In addition to supporting C++modules, this may also support LTO 11requirements and could also deal with generated #include files 12and feed the compiler with prepruned include paths and whatnot. (The 13system calls involved in include searches can be quite expensive on 14some build infrastructures.) 15 16* Client and Server objects 17* Direct connection for in-process use 18* Testing with Joust (that means nothing to you, doesn't it!) 19 20 21## Problem Being Solved 22 23The origin is in C++20 modules: 24``` 25import foo; 26``` 27 28At that import, the compiler needs<sup><a href="#2">2</a></sup> to 29load up the compiled serialization of module `foo`. Where is that 30file? Does it even exist? Unless the build system already knows the 31dependency graph, this might be a completely unknown module. Now, the 32build system knows how to build things, but it might not have complete 33information about the dependencies. The ultimate source of 34dependencies is the source code being compiled, and specifying the 35same thing in multiple places is a recipe for build skew. 36 37Hence, a protocol by which a compiler can query a build system. This 38was originally described in <a 39href="https://wg21.link/p1184r1">p1184r1:A Module Mapper</a>. Along 40with a proof-of-concept hack in GNUmake, described in <a 41href="https://wg21.link/p1602">p1602:Make Me A Module</a>. The current 42implementation has evolved and an update to p1184 will be forthcoming. 43 44## Packet Encoding 45 46The protocol is turn-based. The compiler sends a block of one or more 47requests to the builder, then waits for a block of responses to all of 48those requests. If the builder needs to compile something to satisfy 49a request, there may be some time before the response. A builder may 50service multiple compilers concurrently, each as a separate connection. 51 52When multiple requests are in a block, the responses are also in a 53block, and in corresponding order. The responses must not be 54commenced eagerly -- they must wait until the incoming block has ended 55(as mentioned above, it is turn-based). To do otherwise risks 56deadlock, as there is no requirement for a sending end of the 57communication to listen for incoming responses (or new requests) until 58it has completed sending its current block. 59 60Every request has a response. 61 62Requests and responses are user-readable text. It is not intended as 63a transmission medium to send large binary objects (such as compiled 64modules). It is presumed the builder and the compiler share a file 65system, for that kind of thing.<sup><a href="#3">3</a></sup> 66 67Messages characters are encoded in UTF8. 68 69Messages are a sequence of octets ending with a NEWLINE (0xa). The lines 70consist of a sequence of words, separated by WHITESPACE (0x20 or 0x9). 71Words themselves do not contain WHITESPACE. Lines consisting solely 72of WHITESPACE (or empty) are ignored. 73 74To encode a block of multiple messages, non-final messages end with a 75single word of SEMICOLON (0x3b), immediately before the NEWLINE. Thus 76a serial connection can determine whether a block is complete without 77decoding the messages. 78 79Words containing characters in the set [-+_/%.A-Za-z0-9] need not be 80quoted. Words containing characters outside that set should be 81quoted. A zero-length word may be achieved with `''` 82 83Quoted words begin and end with APOSTROPHE (x27). Within the quoted 84word, BACKSLASH (x5c) is used as an escape mechanism, with the 85following meanings: 86 87* \\n - NEWLINE (0xa) 88* \\t - TAB (0x9) 89* \\' - APOSTROPHE (') 90* \\\\ - BACKSLASH (\\) 91 92Characters in the range [0x00, 0x20) and 0x7f are encoded with one or 93two lowercase hex characters. Octets in the range [0x80,0xff) are 94UTF8 encodings of unicode characters outside the traditional ASCII set 95and passed as such. 96 97Decoding should be more relaxed. Unquoted words containing characters 98in the range [0x20,0xff] other than BACKSLASH or APOSTROPHE should be 99accepted. In a quoted sequence, `\` followed by one or two lower case 100hex characters decode to that octet. Further, words can be 101constructed from a mixture of abutted quoted and unquoted sequences. 102For instance `FOO' 'bar` would decode to the word `FOO bar`. 103 104Notice that the block continuation marker of `;` is not a valid 105encoding of the word `;`, which would be `';'`. 106 107It is recommended that words are separated by single SPACE characters. 108 109## Messages 110 111The message descriptions use `$metavariable` examples. 112 113The request messages are specific to a particular action. The response 114messages are more generic, describing their value types, but not their 115meaning. Message consumers need to know the response to decode them. 116Notice the `Packet::GetRequest()` method records in response packets 117what the request being responded to was. Do not confuse this with the 118`Packet::GetCode ()` method. 119 120### Responses 121 122The simplest response is a single: 123 124`OK` 125 126This indicates the request was successful. 127 128 129An error response is: 130 131`ERROR $message` 132 133The message is a human-readable string. It indicates failure of the request. 134 135Pathnames are encoded with: 136 137`PATHNAME $pathname` 138 139Boolean responses use: 140 141`BOOL `(`TRUE`|`FALSE`) 142 143### Handshake Request 144 145The first message is a handshake: 146 147`HELLO $version $compiler $ident` 148 149The `$version` is a numeric value, currently `1`. `$compiler` identifies 150the compiler — builders may need to keep compiled modules from 151different compilers separate. `$ident` is an identifier the builder 152might use to identify the compilation it is communicating with. 153 154Responses are: 155 156`HELLO $version $builder [$flags]` 157 158A successful handshake. The communication is now connected and other 159messages may be exchanged. An ERROR response indicates an unsuccessful 160handshake. The communication remains unconnected. 161 162There is nothing restricting a handshake to its own message block. Of 163course, if the handshake fails, subsequent non-handshake messages in 164the block will fail (producing error responses). 165 166The `$flags` word, if present allows a server to control what requests 167might be given. See below. 168 169### C++ Module Requests 170 171A set of requests are specific to C++ modules: 172 173#### Flags 174 175Several requests and one response have an optional `$flags` word. 176These are the `Cody::Flags` value pertaining to that request. If 177omitted the value 0 is implied. The following flags are available: 178 179* `0`, `None`: No flags. 180 181* `1<<0`, `NameOnly`: The request is for the name only, and not the 182 CMI contents. 183 184The `NameOnly` flag may be provded in a handshake response, and 185indicates that the server is interested in requests only for their 186implied dependency information. It may be provided on a request to 187indicate that only the CMI name is required, not its contents (for 188instance, when preprocessing). Note that a compiler may still make 189`NameOnly` requests even if the server did not ask for such. 190 191#### Repository 192 193All relative CMI file names are relative to a repository. (There are 194usually no absolute CMI files). The repository may be determined 195with: 196 197`MODULE-REPO` 198 199A PATHNAME response is expected. The `$pathname` may be an empty 200word, which is equivalent to `.`. When the response is a relative 201pathname, it must be relative to the client's current working 202directory (which might be a process on a different host to the 203server). You may set the repository to `/`, if you with to use paths 204relative to the root directory. 205 206#### Exporting 207 208A compilation of a module interface, partition or header unit can 209inform the builder with: 210 211`MODULE-EXPORT $module [$flags]` 212 213This will result in a PATHNAME response naming the Compiled Module 214Interface pathname to write. 215 216The `MODULE-EXPORT` request does not indicate the module has been 217successfully compiled. At most one `MODULE-EXPORT` is to be made, and 218as the connection is for a single compilation, the builder may infer 219dependency relationships between the module being generated and import 220requests made. 221 222Named module names and header unit names are distinguished by making 223the latter unambiguously look like file names. Firstly, they must be 224fully resolved according to the compiler's usual include path. If 225that results in an absolute name file name (beginning with `/`, or 226certain other OS-specific sequences), all is well. Otherwise a 227relative file name must be prefixed by `./` to be distinguished from a 228similarly named named module. This prefixing must occur, even if the 229header-unit's name contains characters that cannot appear in a named 230module's name. 231 232It is expected that absolute header-unit names convert to relative CMI 233names, to keep all CMIs within the CMI repository. This means that 234steps must be taken to distinguish the CMIs for `/here` from `./here`, 235and this can be achieved by replacing the leading `./` directory with 236`,/`, which is visually similar but does not have the self-reference 237semantics of dot. Likewise, header-unit names containing `..` 238directories, can be remapped to `,,`. (When symlinks are involved 239`bob/dob/..` might not be `bob`, of course.) C++ header-unit 240semantics are such that there is no need to resolve multiple ways of 241spelling a particular header-unit to a unique CMI file. 242 243Successful compilation of an interface is indicated with a subsequent: 244 245`MODULE-COMPILED $module [$flags]` 246 247request. This indicates the CMI file has been written to disk, so 248that any other compilations waiting on it may proceed. Depending on 249compiler implementation, the CMI may be written before the compilation 250completes. A single OK response is expected. 251 252Compilation failure can be inferred by lack of a `MODULE-COMPILED` 253request. It is presumed the builder can determine this, as it is also 254responsible for launching and reaping the compiler invocations 255themselves. 256 257#### Importing 258 259Importation, including that of header-units, uses: 260 261`MODULE-IMPORT $module [$flags]` 262 263A PATHNAME response names the CMI file to be read. Should the builder 264have to invoke a compilation to produce the CMI, the response should 265be delayed until that occurs. If such a compilation fails, an error 266response should be provided to the requestor — which will then 267presumably fail in some manner. 268 269#### Include Translation 270 271Include translation can be determined with: 272 273`INCLUDE-TRANSLATE $header [$flags]` 274 275The header name, `$header`, is the fully resolved header name, in the 276above-mentioned unambiguous filename form. The response will either 277be a BOOL response indicating textual inclusion, or a PATHNAME 278response naming the CMI for such translation. The BOOL value is TRUE, 279if the header is known to be a textual header, and FALSE if nothing is 280known about it -- the latter might cause diagnostics about incomplete 281knowledge. 282 283### GCC LTO Messages 284 285These set of requests are used for GCC LTO jobserver integration with GNU Make 286 287## Building libCody 288 289Libcody is written in C++11. (It's a intended for compilers, so 290there'd be a bootstrapping problem if it used the latest and greatest.) 291 292### Using configure and make. 293 294It supports the usual `configure`, `make`, `make check` & `make install` 295sequence. It does not support building in the source directory -- 296that just didn't drop out, and it's not how I build things (because, 297again, for compilers). Excitingly it uses my own `joust` test 298harness, so you'll need to build and install that somewhere, if you 299want the comfort of testing. 300 301The following configure options are available, in addition to the usual set: 302 303* `--enable-checking` Compile with assert-like checking. Defaults to on. 304 305* `--with-tooldir=DIR` Prepend `DIR` to `PATH` when building (`DIR` 306 need not already include the trailing `/bin`, and the right things 307 happen). Use this if you need to point to non-standard tools that 308 you usually don't have in your path. This path is also used when 309 the configure script searches for programs. 310 311* `--with-toolinc=DIR`, `--with-toollib=DIR`, include path and library 312 path variants of `--with-tooldir`. If these are siblings of the 313 tool bin directory, they'll be found automatically. 314 315* `--with-compiler=NAME` Specify a particular compiler to use. 316 Usually what configure finds is sufficiently usable. 317 318* `--with-bugurl=URL` Override the bugreporting URL. Do this if 319 you're providing libcody as part of a package that /you/ are 320 supporting. 321 322* `--enable-maintainer-mode` Specify that rules to rebuild things like 323 `configure` (with `autoconf`) should be enabled. When not enabled, 324 you'll get a message if these appear out of date, but that can 325 happen naturally after an update or clone as `git`, in common with 326 other VCs, doesn't preserve the relative ordering of file 327 modifications. You can use `make MAINTAINER=touch` to shut make up, 328 if this occurs (or manually execute the `autoconf` and related 329 commands). 330 331When building, you can override the default optimization flags with 332`CXXFLAGS=$flags`. I often build a debuggable library with `make 333CXXFLAGS=-g3`. 334 335The `Makefile` will also parallelize according to the number of CPUs, 336unless you specify explicitly with a `-j` option. This is a little 337clunky, as it's not possible to figure out inside the makefile whether 338the user provided `-j`. (Or at least I've not figured out how.) 339 340### Using cmake and make 341 342#### In the clang/LLVM project 343 344The primary motivation for a cmake implementation is to allow building 345libcody "in tree" in clang/LLVM. In that case, a checkout of libcody 346can be placed (or symbolically linked) into clang/tools. This will 347configure and build the library along with other LLVM dependencies. 348 349*NOTE* This is not treated as an installable entity (it is present only 350for use by the project). 351 352*NOTE* The testing targets would not be appropriate in this configuration; 353it is expected that lit-based testing of the required functionality will be 354done by the code using the library. 355 356#### Stand-alone 357 358For use on platforms that don't support configure & make effectively, it 359is possible to use the cmake & make process in stand-alone mode (similar 360to the configure & make process above). 361 362An example use. 363``` 364cmake -DCMAKE_INSTALL_PREFIX=/path/to/installation -DCMAKE_CXX_COMPILER=clang++ /path/to/libcody/source 365make 366make install 367``` 368Supported flags (additions to the usual cmake ones). 369 370* `-DCODY_CHECKING=ON,OFF`: Compile with assert-like checking. (defaults ON) 371 372* `-DCODY_WITHEXCEPTIONS=ON,OFF`: Compile with C++ exceptions and RTTI enabled. 373(defaults OFF, to be compatible with GCC and LLVM). 374 375*TODO*: At present there is no support for `ctest` integration (this should be 376feasible, provided that `joust` is installed and can be discovered by `cmake`). 377 378## API 379 380The library defines entities in the `::Cody` namespace. 381 382There are 4 user-visible classes: 383 384* `Packet`: Responses to requests are `Packets`. These have a code, 385 indicating the response kind, and a payload. 386 387* `Client`: The compiler-end of a connection. Requests may be made 388 and responses are returned. 389 390* `Server`: The builder-end of a connection. Requests may be waited 391 for, and responses made. Builders that serve multiple concurrent 392 connections and spawn compilations to resolve dependencies may need 393 to derive from this class to provide response queuing. 394 395* `Resolver`: The processing engine of the builder side. User code is 396 expected to derive from this class and provide virtual function 397 overriders to affect the semantics of the resolver. 398 399In addition there are a number of helpers to setup connections. 400 401Logically the Client and the Server communicate via a sequential 402channel. The channel may be provided by: 403 404* two pipes, with different file descriptors for reading and writing 405 at each end. 406 407* a socket, which will use the same file descriptor for reading and 408 writing. the socket can be created in a number of ways, including 409 Unix domain and IPv6 TCP, for which helpers are provided. 410 411* a direct, in-process, connection, using buffer swapping. 412 413The communication channel is presumed reliable. 414 415Refer to the (currently very sparse) doxygen-generated documentation 416for details of the API. 417 418## Examples 419 420To create an in-process resolver, use the following boilerplate: 421 422``` 423class MyResolver : Cody::Resolver { ... stuff here ... }; 424 425Cody::Client *MakeClient (char const *maybe_ident) 426{ 427 auto *r = new MyResolver (...); 428 auto *s = new Cody::Server (r); 429 auto *c = new Cody::Client (s); 430 431 auto t = c->ConnectRequest ("ME", maybe_ident); 432 if (t.GetCode () == Cody::Client::TC_CONNECT) 433 ;// Yay! 434 else if (t.GetCode () == Cody::Client::TC_ERROR) 435 report_error (t.GetString ()); 436 437 return c; 438} 439 440``` 441 442For a remotely connecting client: 443``` 444Cody::Client *MakeClient () 445{ 446 char const *err = nullptr; 447 int fd = OpenInet6 (char const **err, name, port); 448 if (fd < 0) 449 { ... error... return nullptr;} 450 451 auto *c = new Cody::Client (fd); 452 453 auto t = c->ConnectRequest ("ME", maybe_ident); 454 if (t.GetCode () == Cody::Client::TC_CONNECT) 455 ;// Yay! 456 else if (t.GetCode () == Cody::Client::TC_ERROR) 457 report_error (t.GetString ()); 458 459 return c; 460} 461``` 462 463# Future Directions 464 465* Current Directory. There is no mechanism to check the builder and 466 the compiler have the same working directory. Perhaps that should 467 be addressed. 468 469* Include path canonization and/or header file lookup. This can be 470 expensive, particularly with many `-I` options, due to the system 471 calls. Perhaps using a common resource would be cheaper? 472 473* Generated header file lookup/construction. This is essentially the 474 same problem as importing a module, and build systems are crap at 475 dealing with this. 476 477* Link-time compilations. Another place the compiler would like to 478 ask the build system to do things. 479 480* C++20 API entrypoints — std:string_view would be nice 481 482* Exception-safety audit. Exceptions are not used, but memory 483 exhaustion could happen. And perhaps user's resolver code employs 484 exceptions? 485 486<a name="1">1</a>: Or a small town in Wyoming 487 488<a name="2">2</a>: This describes one common implementation technique. 489The std itself doesn't require such serializations, but the ability to 490create them is kind of the point. Also, 'compiler' is used where we 491mean any consumer of a module, and 'build system' where we mean any 492producer of a module. 493 494<a name="3">3</a>: Even when the builder is managing a distributed set 495of compilations, the builder must have a mechanism to get source files 496to, and object files from, the compilations. That scheme can also 497transfer the CMI files. 498