• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

auxents/H24-Mar-2021-1,143947

cli/H24-Mar-2021-3,5302,892

containers/H24-Mar-2021-10,7897,659

dsl/H24-Mar-2021-15,50911,646

experimental/H24-Mar-2021-1,5921,264

input/H24-Mar-2021-5,6974,487

lib/H24-Mar-2021-9,2076,676

mapping/H24-Mar-2021-14,23611,601

msys2/H24-Mar-2021-4336

output/H24-Mar-2021-2,3251,927

parsing/H24-Mar-2021-12,56410,008

reg_test/H24-Mar-2021-75,44266,242

stream/H24-Mar-2021-877699

tools/H24-Mar-2021-9974

u/H24-Mar-2021-6965

unit_test/H24-Mar-2021-6,3715,189

.gitignoreH A D24-Mar-2021119 1615

.vimrcH A D24-Mar-2021110 65

Makefile.amH A D24-Mar-20211.9 KiB7327

Makefile.inH A D24-Mar-202128 KiB859721

Makefile.no-autoconfigH A D24-Mar-202114.4 KiB544433

Makefile.windowsH A D24-Mar-202113.4 KiB500391

README.mdH A D24-Mar-20214.8 KiB10177

asanmkH A D24-Mar-202187 52

camakeH A D24-Mar-2021360 126

cmakeH A D24-Mar-2021341 136

draft-release-notes.mdH A D24-Mar-20212.1 KiB2714

mlrmain.cH A D24-Mar-2021998 3926

mlrvers.hH A D24-Mar-2021166 64

ooH A D24-Mar-20211.8 KiB7861

pre-travis.shH A D24-Mar-202173 75

pushlH A D24-Mar-202199 63

regdiffH A D24-Mar-202189 31

stdlib.mlrH A D24-Mar-2021493 3833

vgrunH A D24-Mar-2021243 117

winpatch.diffH A D24-Mar-20212.3 KiB8477

winrun.batH A D24-Mar-202136 21

README.md

1# Data flow
2
3Miller data flow is records produced by a record-reader in `input/`, followed
4by one or more mappers in `mapping/`, written by a record-writer in `output/`,
5controlled by logic in `stream/`. Argument parsing for initial stream setup is
6in `cli/`.
7
8# Container names
9
10The user-visible concept of *stream record* (or *srec*) is implemented in the
11`lrec_t` (*linked-record type*) data structure. The user-visible concept of
12*out-of-stream variables* is implemented using the `mlhmmv_t` (multi-level
13hashmap of mlrvals) structure. Source-code comments and names within the code
14refer to `srec`/`lrec` and `oosvar`/`mlhmmv` depending on the context.
15
16While those two data structures contain user-visible data structures, others
17are used in Miller's implementation: `slls` and `sllv` are singly-linked lists
18of string and void-star respectively; `lhmss` is a linked hashmap from string
19to string; `lhmsi` is a linked hashmap from string to int; and so on.
20
21# Memory management
22
23Miller is streaming and as near stateless as possible. For most Miller
24functions, you can ingest a 20GB file with 4GB RAM, no problem.  For example,
25`mlr cat` of a DKVP file retains no data in memory from one line to another;
26`mlr cat` of a CSV file retains only the field names from the header line. The
27`stats1` and `stats2` commands retain only aggregation state (e.g. count and
28sum over specified fields needed to compute mean of specified fields). The `mlr
29tac` and `mlr sort` commands, obviously, need to consume and retain all input
30records before emitting any output records.
31
32Miller classes are in general modular, following a constructor/destructor model
33with minimal dependencies between classes.  As a general rule, void-star
34payloads (`sllv`, `lhmslv`) must be freed by the callee (which has access to
35the data type) whereas non-void-star payloads (`slls`, `hss`) are freed by the
36container class.
37
38One complication is for free-flags in `lrec` and `slls`: the idea is that an
39entire line is mallocked and presented by the record reader; then individual
40fields are split out and populated into linked list or records. To reduce the
41amount of strduping there, free-flags are used to track which fields should be
42freed by the destructor and which are freed elsewhere.
43
44The `header_keeper` object is an elaboration on this theme: suppose there is a
45CSV file with header line `a,b,c` and data lines `1,2,3`, then `4,5,6`, then
46`7,8,9`. Then the keys `a`, `b`, and `c` are shared between all three records;
47they are retained in a single `header_keeper` object.
48
49A bigger complication to the otherwise modular nature of Miller is its
50*baton-passing memory-management model*. Namely, one class may be responsible
51for freeing memory allocated by another class.
52
53For example, using `mlr cat`: The record-reader produces records and returns
54pointers to them.  The record-mapper is just a pass-through; it returns the
55record-pointers it receives.  The record-writer formats the records to stdout
56and does not return them, so it is responsible for freeing them.
57
58Similarly, `mlr cut -x` and any other mappers which modify record objects
59without creating new ones. By contrast,`stats1` et al. produce their own
60records; they free what they do not pass on.
61
62# Null-lrec conventions
63
64Record-readers return a null lrec-pointer to signify end of input stream.
65
66Each mapper takes an lrec-pointer as input and returns a linked list of
67lrec-pointer.
68
69Null-lrec is input to mappers to signify end of stream: e.g. `sort` or `tac`
70should use this as a signal to deliver the sorted/reversed list of rows.
71
72When a mapper has no output before end of stream (e.g. `sort` or `tac` while
73accumulating inputs) it returns a null lrec-pointer which is treated as
74synonymous with returning an empty list.
75
76At end of stream, a mapper returns a linked list of records ending in a null
77lrec-pointer.
78
79A null lrec-pointer at end of stream is passed to lrec writers so that they may
80produce final output (e.g. pretty-print which produces no output until end of
81stream).
82
83# Performance optimizations
84
85The initial implementation of Miller used `lhmss`
86(insertion-ordered string-to-string hash map) for record objects.
87Keys and values were strduped out of file-input lines. Each of the following
88produced from 5 to 30 percent performance gains:
89* The `lrec` object is a hashless map suited to low access-to-creation ratio.
90See detailed comments in
91https://github.com/johnkerl/miller/blob/master/c/containers/lrec.h.
92* Free-flags as discussed above removed additional occurrences of string copies.
93* Using `mmap` to read files gets rid of double passes on record parsing
94(one to find end of line, and another to separate fields) as well as most use
95of `malloc`. Note however that standard input cannot be mmapped, so both
96record-reader options are retained.
97
98# Source-code indexing
99
100Please see https://sourcegraph.com/github.com/johnkerl/miller
101