ReadMe.md
1The include/ subdirectory contains two (LGPL3-licensed) major libraries, while
2the immediate directory contains the PLINK 2.0 application built on top of
3them. These are carefully written to be valid C99 (from gcc and clang's
4perspective, anyway) to simplify FFI development, while still taking advantage
5of quite a few C++-specific affordances to improve safety and occasionally
6performance. They are currently x86-specific, but there are annotations to
7facilitate a possible future port to ARM.
8
9The first library is plink2_text, which provides a pair of classes designed to
10replace std::getline(), fgets(), and similar ways of iterating over text lines.
11Key properties:
12* Instead of copying every line to your buffer, one at a time, these classes
13 just return a pointer to the beginning of each line in the underlying binary
14 stream, and give you access to a pointer to the end. In exchange, the line
15 is invalidated when you iterate to the next one; it's like being forced to
16 pass the same string to std::getline(), or the same buffer to fgets(), on
17 every call. But whenever that's problematic, you can always copy the line
18 before iterating to the next; on all systems I've seen, that *still* exhibits
19 better throughput than getline/fgets. And in the many situations where
20 there's no need to copy, you get a fundamentally lower-latency abstraction.
21* They automatically detect and decompress gzipped and Zstd-compressed
22 (https://facebook.github.io/zstd/ ) files, in a manner that works with pipe
23 file descriptors.
24* The primary TextStream class automatically reads *and decompresses* ahead for
25 you. Decompression is even multithreaded by default when the file is
26 BGZF-compressed. (And the textFILE class covers the setting where you don't
27 want to launch any more threads.)
28* They do not support network input as of this writing, but that would not be
29 difficult to add. The existing code uses FILE* in a very straightforward
30 manner.
31* As for text parsing, the ScanadvDouble() utility function in the
32 plink2_string component is a very efficient string-to-double converter.
33 While it does not support perfect string<->double round-trips (that's what
34 C++17 std::from_chars is for; https://abseil.io/ has a working implementation
35 while we wait for gcc/clang...), or long-tail features like locale-specific
36 decimal separators or hex floats, it has been incredibly useful for speeding
37 up the basic job of scanning standard-locale printf("%g")-formatted and
38 similar output. (Note that you lose roughly a billion times as much accuracy
39 to %g's 6-digit limit as you do to imperfect string->double conversion in
40 that setting.)
41
42(Coming soon: example text-processing programs using plink2_text.)
43
44The second library is pgenlib. This supports reading and writing of PLINK 2.x
45genotype files (".pgen"). A draft specification for this format is under
46https://github.com/chrchang/plink-ng/tree/master/pgen_spec ; here are some key
47properties:
48* A PLINK 1 .bed is a valid .pgen.
49* In addition, .pgen can represent multiallelic, phased, and/or dosage
50 information. As of this writing, software support for multiallelic dosages
51 does not exist yet, but it does for the other attribute pairs
52 (multiallelic+phased, phased+dosage).
53* **.pgen CANNOT represent genotype probability triplets. It also cannot store
54 read depths, per-call quality scores, etc.** While plink2 can *filter* on
55 the aforementioned BGEN/VCF fields during import, it cannot re-export or do
56 anything else with them. Use other software, such as bcftools
57 (https://samtools.github.io/bcftools/bcftools.html ) or qctool2
58 (www.well.ox.ac.uk/~gav/qctool_v2/ ) when you must retain any of these
59 fields.
60* .pgen is compressed, but in a domain-specific manner that supports very fast
61 compression and decompression. It is even practical to perform several key
62 computations (e.g. allele frequency) directly on the compressed
63 representation, and this capability is exposed by the pgenlib library.
64* Python/pgenlib.pyx is the Python wrapper (see Python/python_api.txt for
65 details), and pgenlibr/ is the R wrapper. These are somewhat incomplete as
66 of this writing, but it would not take much effort to fill in key components;
67 that work is scheduled for roughly the time of the beta release, but if you
68 could really use a specific feature earlier, you have good odds of getting it
69 by asking at https://groups.google.com/forum/#!forum/plink2-dev .
70 (plink2-dev is also the place to ask other questions about any of this code.)
71
72As for the PLINK 2.0 application:
73* build_dynamic/ contains a Makefile suitable for Linux and macOS dynamic
74 builds. On Linux, if Intel MKL is installed using the instructions at e.g.
75 https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo ,
76 you can dynamically link to it.
77* build_win/ contains a Makefile for producing static Windows builds. This
78 requires MinGW[-w64] and zlib; a prebuilt OpenBLAS package from
79 https://sourceforge.net/projects/openblas/files/ is also strongly
80 recommended.
81* GPUs are not exploited, and there are currently no plans to write much
82 GPU-specific code before PLINK 2.0's core function set is completed around
83 2021. However, a few linear-algebra-heavy workloads may benefit
84 significantly from a simple replacement of Intel MKL by cuBLAS + cuSOLVER.
85 This can probably be supported earlier; feel free to open a GitHub issue
86 about it if it would make a big difference to you.
87* The LGPL3-licensed plink2_stats component may be of independent interest. It
88 includes a function for computing the 2x2 Fisher's exact test p-value in
89 approximately O(sqrt(n)) time--much faster than the O(n) algorithms employed
90 by other libraries as of this writing--as well as several log-p-value
91 computations (Z-score/chi-square, T-test, F-test) that remain accurate well
92 beyond the limits of most other statistical library functions. (No, you
93 don't want to take a 10^{-1000000} p-value literally, but it can be useful to
94 distinguish it from 10^{-325}, and both of these numbers can naturally arise
95 when analyzing biobank-scale data.)
96* More documentation is at www.cog-genomics.org/plink/2.0/ .
97