• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

Makefile.amH A D24-Mar-2021842 3633

Makefile.inH A D24-Mar-202139.8 KiB768666

README.mdH A D24-Mar-20212.5 KiB4638

byte_reader.hH A D24-Mar-20211.1 KiB3113

byte_readers.hH A D24-Mar-2021289 128

file_ingestor_stdio.cH A D24-Mar-20212.3 KiB7162

file_ingestor_stdio.hH A D24-Mar-20211.2 KiB2410

file_reader_stdio.cH A D24-Mar-20211.5 KiB5448

file_reader_stdio.hH A D24-Mar-2021428 125

json_parser.cH A D24-Mar-202125.8 KiB1,031803

json_parser.hH A D24-Mar-20214.1 KiB16386

line_readers.cH A D24-Mar-20215.8 KiB237208

line_readers.hH A D24-Mar-20212.3 KiB7150

lrec_reader.hH A D24-Mar-20211 KiB2720

lrec_reader_gen.cH A D24-Mar-20212.5 KiB7760

lrec_reader_in_memory.cH A D24-Mar-20212.5 KiB7152

lrec_reader_stdio_csv.cH A D24-Mar-202122.5 KiB631498

lrec_reader_stdio_csvlite.cH A D24-Mar-202117.4 KiB574472

lrec_reader_stdio_dkvp.cH A D24-Mar-202111.8 KiB346270

lrec_reader_stdio_json.cH A D24-Mar-20218.5 KiB217132

lrec_reader_stdio_nidx.cH A D24-Mar-20219.8 KiB275219

lrec_reader_stdio_xtab.cH A D24-Mar-20216 KiB187147

lrec_readers.cH A D24-Mar-20211.9 KiB4441

lrec_readers.hH A D24-Mar-20213.3 KiB5639

mlr_json_adapter.cH A D24-Mar-202110.3 KiB321273

mlr_json_adapter.hH A D24-Mar-20211.5 KiB3111

peek_file_reader.cH A D24-Mar-2021890 2926

peek_file_reader.hH A D24-Mar-20213 KiB9566

stdio_byte_reader.cH A D24-Mar-20212.9 KiB9984

string_byte_reader.cH A D24-Mar-20211.9 KiB6147

README.md

1# Miller file/record input
2
3These are readers for Miller file formats, stdio and mmap versions. The stdio
4and mmap record parsers are similar but not identical, due to inversion of
5processing order: getting an entire mallocked line and then splitting it by
6separators in the former case, versus splitting while discovering end of line in
7the latter case. The code duplication could be largely removed by having the
8mmap readers find end-of-lines, then split up the lines -- however that
9requires two passes through input strings and for performance I want just a
10single pass.
11
12While there are separate record-writers for CSV and pretty-print, there is just
13a common record-reader: pretty-print is CSV with field separator being a space,
14and `allow_repeat_ifs` set to true.
15
16Idea of `header_keeper` objects for CSV: each `header_keeper` object retains
17the input-line backing and the `slls_t` for a CSV header line which is used by
18one or more CSV data lines.  Meanwhile some mappers (e.g. `sort`, `tac`) retain
19input records from the entire data stream, which may include header-schema
20changes in the input stream. This means we need to keep headers intact as long
21as any lrecs are pointing to them.  One option is reference-counting which I
22experimented with; it was messy and error-prone. The approach used here is to
23keep a hash map from header-schema to `header_keeper` object. The current
24`pheader_keeper` is a pointer into one of those.  Then when the reader is
25freed, all its header-keepers are freed.
26
27There is some code duplication involving single-character and multi-character
28IRS, IFS, and IPS. While single-character is a special case of multi-character,
29keeping separate implementations for single-character and multi-character
30versions is worthwhile for performance. The difference is betweeen `*p == ifs`
31and `streqn(p, ifs, ifslen)`: even with function inlining, the latter is more
32expensive than the former in the single-character case.
33
34Example timing info for a million-line file is as follows:
35
36```
37TIME IN SECONDS 0.945 -- mlr --irs lf   --ifs ,  --ips =  check ../data/big.dkvp2
38TIME IN SECONDS 1.139 -- mlr --irs crlf --ifs ,  --ips =  check ../data/big.dkvp2
39TIME IN SECONDS 1.291 -- mlr --irs lf   --ifs /, --ips =: check ../data/big.dkvp2
40TIME IN SECONDS 1.443 -- mlr --irs crlf --ifs /, --ips =: check ../data/big.dkvp2
41```
42
43i.e. (even when averaged over multiple runs) performance improvements of 20-30%
44are obtained by special-casing single-character-separator code: this is worth
45doing.
46