README.md
1# Miller file/record input
2
3These are readers for Miller file formats, stdio and mmap versions. The stdio
4and mmap record parsers are similar but not identical, due to inversion of
5processing order: getting an entire mallocked line and then splitting it by
6separators in the former case, versus splitting while discovering end of line in
7the latter case. The code duplication could be largely removed by having the
8mmap readers find end-of-lines, then split up the lines -- however that
9requires two passes through input strings and for performance I want just a
10single pass.
11
12While there are separate record-writers for CSV and pretty-print, there is just
13a common record-reader: pretty-print is CSV with field separator being a space,
14and `allow_repeat_ifs` set to true.
15
16Idea of `header_keeper` objects for CSV: each `header_keeper` object retains
17the input-line backing and the `slls_t` for a CSV header line which is used by
18one or more CSV data lines. Meanwhile some mappers (e.g. `sort`, `tac`) retain
19input records from the entire data stream, which may include header-schema
20changes in the input stream. This means we need to keep headers intact as long
21as any lrecs are pointing to them. One option is reference-counting which I
22experimented with; it was messy and error-prone. The approach used here is to
23keep a hash map from header-schema to `header_keeper` object. The current
24`pheader_keeper` is a pointer into one of those. Then when the reader is
25freed, all its header-keepers are freed.
26
27There is some code duplication involving single-character and multi-character
28IRS, IFS, and IPS. While single-character is a special case of multi-character,
29keeping separate implementations for single-character and multi-character
30versions is worthwhile for performance. The difference is betweeen `*p == ifs`
31and `streqn(p, ifs, ifslen)`: even with function inlining, the latter is more
32expensive than the former in the single-character case.
33
34Example timing info for a million-line file is as follows:
35
36```
37TIME IN SECONDS 0.945 -- mlr --irs lf --ifs , --ips = check ../data/big.dkvp2
38TIME IN SECONDS 1.139 -- mlr --irs crlf --ifs , --ips = check ../data/big.dkvp2
39TIME IN SECONDS 1.291 -- mlr --irs lf --ifs /, --ips =: check ../data/big.dkvp2
40TIME IN SECONDS 1.443 -- mlr --irs crlf --ifs /, --ips =: check ../data/big.dkvp2
41```
42
43i.e. (even when averaged over multiple runs) performance improvements of 20-30%
44are obtained by special-casing single-character-separator code: this is worth
45doing.
46