1# Data flow 2 3Miller data flow is records produced by a record-reader in `input/`, followed 4by one or more mappers in `mapping/`, written by a record-writer in `output/`, 5controlled by logic in `stream/`. Argument parsing for initial stream setup is 6in `cli/`. 7 8# Container names 9 10The user-visible concept of *stream record* (or *srec*) is implemented in the 11`lrec_t` (*linked-record type*) data structure. The user-visible concept of 12*out-of-stream variables* is implemented using the `mlhmmv_t` (multi-level 13hashmap of mlrvals) structure. Source-code comments and names within the code 14refer to `srec`/`lrec` and `oosvar`/`mlhmmv` depending on the context. 15 16While those two data structures contain user-visible data structures, others 17are used in Miller's implementation: `slls` and `sllv` are singly-linked lists 18of string and void-star respectively; `lhmss` is a linked hashmap from string 19to string; `lhmsi` is a linked hashmap from string to int; and so on. 20 21# Memory management 22 23Miller is streaming and as near stateless as possible. For most Miller 24functions, you can ingest a 20GB file with 4GB RAM, no problem. For example, 25`mlr cat` of a DKVP file retains no data in memory from one line to another; 26`mlr cat` of a CSV file retains only the field names from the header line. The 27`stats1` and `stats2` commands retain only aggregation state (e.g. count and 28sum over specified fields needed to compute mean of specified fields). The `mlr 29tac` and `mlr sort` commands, obviously, need to consume and retain all input 30records before emitting any output records. 31 32Miller classes are in general modular, following a constructor/destructor model 33with minimal dependencies between classes. As a general rule, void-star 34payloads (`sllv`, `lhmslv`) must be freed by the callee (which has access to 35the data type) whereas non-void-star payloads (`slls`, `hss`) are freed by the 36container class. 37 38One complication is for free-flags in `lrec` and `slls`: the idea is that an 39entire line is mallocked and presented by the record reader; then individual 40fields are split out and populated into linked list or records. To reduce the 41amount of strduping there, free-flags are used to track which fields should be 42freed by the destructor and which are freed elsewhere. 43 44The `header_keeper` object is an elaboration on this theme: suppose there is a 45CSV file with header line `a,b,c` and data lines `1,2,3`, then `4,5,6`, then 46`7,8,9`. Then the keys `a`, `b`, and `c` are shared between all three records; 47they are retained in a single `header_keeper` object. 48 49A bigger complication to the otherwise modular nature of Miller is its 50*baton-passing memory-management model*. Namely, one class may be responsible 51for freeing memory allocated by another class. 52 53For example, using `mlr cat`: The record-reader produces records and returns 54pointers to them. The record-mapper is just a pass-through; it returns the 55record-pointers it receives. The record-writer formats the records to stdout 56and does not return them, so it is responsible for freeing them. 57 58Similarly, `mlr cut -x` and any other mappers which modify record objects 59without creating new ones. By contrast,`stats1` et al. produce their own 60records; they free what they do not pass on. 61 62# Null-lrec conventions 63 64Record-readers return a null lrec-pointer to signify end of input stream. 65 66Each mapper takes an lrec-pointer as input and returns a linked list of 67lrec-pointer. 68 69Null-lrec is input to mappers to signify end of stream: e.g. `sort` or `tac` 70should use this as a signal to deliver the sorted/reversed list of rows. 71 72When a mapper has no output before end of stream (e.g. `sort` or `tac` while 73accumulating inputs) it returns a null lrec-pointer which is treated as 74synonymous with returning an empty list. 75 76At end of stream, a mapper returns a linked list of records ending in a null 77lrec-pointer. 78 79A null lrec-pointer at end of stream is passed to lrec writers so that they may 80produce final output (e.g. pretty-print which produces no output until end of 81stream). 82 83# Performance optimizations 84 85The initial implementation of Miller used `lhmss` 86(insertion-ordered string-to-string hash map) for record objects. 87Keys and values were strduped out of file-input lines. Each of the following 88produced from 5 to 30 percent performance gains: 89* The `lrec` object is a hashless map suited to low access-to-creation ratio. 90See detailed comments in 91https://github.com/johnkerl/miller/blob/master/c/containers/lrec.h. 92* Free-flags as discussed above removed additional occurrences of string copies. 93* Using `mmap` to read files gets rid of double passes on record parsing 94(one to find end of line, and another to separate fields) as well as most use 95of `malloc`. Note however that standard input cannot be mmapped, so both 96record-reader options are retained. 97 98# Source-code indexing 99 100Please see https://sourcegraph.com/github.com/johnkerl/miller 101