miller-5.10.2/c/input

# Miller file/record input

These are readers for Miller file formats, stdio and mmap versions. The stdio
and mmap record parsers are similar but not identical, due to inversion of
processing order: getting an entire mallocked line and then splitting it by
separators in the former case, versus splitting while discovering end of line in
the latter case. The code duplication could be largely removed by having the
mmap readers find end-of-lines, then split up the lines -- however that
requires two passes through input strings and for performance I want just a
single pass.

While there are separate record-writers for CSV and pretty-print, there is just
a common record-reader: pretty-print is CSV with field separator being a space,
and `allow_repeat_ifs` set to true.

Idea of `header_keeper` objects for CSV: each `header_keeper` object retains
the input-line backing and the `slls_t` for a CSV header line which is used by
one or more CSV data lines.  Meanwhile some mappers (e.g. `sort`, `tac`) retain
input records from the entire data stream, which may include header-schema
changes in the input stream. This means we need to keep headers intact as long
as any lrecs are pointing to them.  One option is reference-counting which I
experimented with; it was messy and error-prone. The approach used here is to
keep a hash map from header-schema to `header_keeper` object. The current
`pheader_keeper` is a pointer into one of those.  Then when the reader is
freed, all its header-keepers are freed.

There is some code duplication involving single-character and multi-character
IRS, IFS, and IPS. While single-character is a special case of multi-character,
keeping separate implementations for single-character and multi-character
versions is worthwhile for performance. The difference is betweeen `*p == ifs`
and `streqn(p, ifs, ifslen)`: even with function inlining, the latter is more
expensive than the former in the single-character case.

Example timing info for a million-line file is as follows:

```
TIME IN SECONDS 0.945 -- mlr --irs lf   --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.139 -- mlr --irs crlf --ifs ,  --ips =  check ../data/big.dkvp2
TIME IN SECONDS 1.291 -- mlr --irs lf   --ifs /, --ips =: check ../data/big.dkvp2
TIME IN SECONDS 1.443 -- mlr --irs crlf --ifs /, --ips =: check ../data/big.dkvp2
```

i.e. (even when averaged over multiple runs) performance improvements of 20-30%
are obtained by special-casing single-character-separator code: this is worth
doing.
Name		Date	Size	#Lines	LOC
..		03-May-2022	-
Makefile.am	H A D	24-Mar-2021	842	36	33
Makefile.in	H A D	24-Mar-2021	39.8 KiB	768	666
README.md	H A D	24-Mar-2021	2.5 KiB	46	38
byte_reader.h	H A D	24-Mar-2021	1.1 KiB	31	13
byte_readers.h	H A D	24-Mar-2021	289	12	8
file_ingestor_stdio.c	H A D	24-Mar-2021	2.3 KiB	71	62
file_ingestor_stdio.h	H A D	24-Mar-2021	1.2 KiB	24	10
file_reader_stdio.c	H A D	24-Mar-2021	1.5 KiB	54	48
file_reader_stdio.h	H A D	24-Mar-2021	428	12	5
json_parser.c	H A D	24-Mar-2021	25.8 KiB	1,031	803
json_parser.h	H A D	24-Mar-2021	4.1 KiB	163	86
line_readers.c	H A D	24-Mar-2021	5.8 KiB	237	208
line_readers.h	H A D	24-Mar-2021	2.3 KiB	71	50
lrec_reader.h	H A D	24-Mar-2021	1 KiB	27	20
lrec_reader_gen.c	H A D	24-Mar-2021	2.5 KiB	77	60
lrec_reader_in_memory.c	H A D	24-Mar-2021	2.5 KiB	71	52
lrec_reader_stdio_csv.c	H A D	24-Mar-2021	22.5 KiB	631	498
lrec_reader_stdio_csvlite.c	H A D	24-Mar-2021	17.4 KiB	574	472
lrec_reader_stdio_dkvp.c	H A D	24-Mar-2021	11.8 KiB	346	270
lrec_reader_stdio_json.c	H A D	24-Mar-2021	8.5 KiB	217	132
lrec_reader_stdio_nidx.c	H A D	24-Mar-2021	9.8 KiB	275	219
lrec_reader_stdio_xtab.c	H A D	24-Mar-2021	6 KiB	187	147
lrec_readers.c	H A D	24-Mar-2021	1.9 KiB	44	41
lrec_readers.h	H A D	24-Mar-2021	3.3 KiB	56	39
mlr_json_adapter.c	H A D	24-Mar-2021	10.3 KiB	321	273
mlr_json_adapter.h	H A D	24-Mar-2021	1.5 KiB	31	11
peek_file_reader.c	H A D	24-Mar-2021	890	29	26
peek_file_reader.h	H A D	24-Mar-2021	3 KiB	95	66
stdio_byte_reader.c	H A D	24-Mar-2021	2.9 KiB	99	84
string_byte_reader.c	H A D	24-Mar-2021	1.9 KiB	61	47