• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

MSVC_includes/H27-Sep-2018-94

docs/H27-Sep-2018-1,6871,271

io_lib/H03-May-2022-68,11440,994

m4/H27-Sep-2018-9791

man/H27-Sep-2018-1,3621,216

progs/H03-May-2022-15,42710,322

tests/H27-Sep-2018-32,22431,944

.gitignoreH A D27-Sep-2018306 2119

.travis.ymlH A D27-Sep-201882 53

CHANGESH A D27-Sep-201871 KiB2,0831,458

COPYRIGHTH A D27-Sep-20183 KiB6961

Makefile.amH A D27-Sep-20183.5 KiB12284

README.mdH A D27-Sep-201810.2 KiB306213

acinclude.m4H A D27-Sep-201816.7 KiB468412

bootstrapH A D27-Sep-2018506 127

configure.inH A D27-Sep-20186.2 KiB200172

io_lib-config.inH A D27-Sep-2018859 4631

io_lib.m4H A D27-Sep-20182.3 KiB7770

options.mkH A D27-Sep-2018887 5938

README.md

1Io_lib:  Version 1.14.10
2========================
3
4Io_lib is a library of file reading and writing code to provide a general
5purpose SAM/BAM/CRAM, trace file (and Experiment File) reading
6interface.  Programmatically {S,B,CR}AM can be manipulated using the
7scram_*() API functions while DNA Chromatogram ("trace") files  can be
8read using the read_reading() function.
9
10It has been compiled and tested on a variety of unix systems, MacOS X
11and MS Windows.
12
13The directories below here contain the io_lib code. These support the
14following file formats:
15
16	SAM/BAM sequence files
17	CRAM sequence files
18	SCF trace files
19	ABI trace files
20	ALF trace files
21	ZTR trace files
22	SFF trace archives
23	SRF trace archives
24	Experiment files
25	Plain text files
26
27These link together to form a single "libstaden-read" library supporting
28all the file formats via a single read_reading (or fread_reading or
29mfread_reading) function call and analogous write_reading functions
30too. See the file include/Read.h for the generic 'Read' structure.
31
32See the CHANGES for a summary of older updates or git logs for the
33full details.
34
35
36Version 1.14.10 (26th September 2018)
37---------------
38
39Updates:
40
41* BAM: Libdeflate support (https://github.com/ebiggers/libdeflate).
42  This library is significantly faster than zlib, so it is a good
43  alternative to the Cloudflare and/or Intel libraries.
44
45  See below for details.
46
47* CRAM *EXPERIMENTAL*: Added custom quality and identifier codecs.
48  Also added the ability to use libbsc as a general purpose codec.
49
50  These are NOT OFFICIAL and so not enabled by default (version 3.0).
51  However as a technology demonstration only, they are available with
52  scramble -V3.1 or -V4.0 for evaluation and to promote discussion on
53  future CRAM formats.  Do not use these on production data.
54
55  Implementations of the codecs and CRAM version 4.0 layout are liable
56  to change without prior warning.
57
58* CRAM: name sorted files now automatically switch to non-ref mode.
59
60Bug fixes:
61
62* CRAM: Considerable fixes to multi-threading.
63  - Using more than 1 slice per container with threading now works.
64  - Removal of race conditions when using CRAM_OPT_REQUIRED_FIELDS.
65  - Combinations of ref and no-ref mode in adjacent containers.
66  - Other misc. threading bugs.
67
68* Corrected end-of-range check in some scenarios.
69
70* CRAM: bug fix to index creation when a slice contains exactly one
71  alignment.
72
73* SAM: fixed parsing of illegal sequence characters (eg "Z").
74  These are now treated as "N" and not "=".
75
76* BAM/SAM: protect against out of bound CIGAR operations.
77
78* CRAM: hardening of rANS codec against malicious input.
79  Also fixed a very rare frequency renormalisation case.
80
81* CRAM: fix with range queries used in conjuction with turning off
82  sequence retrieval (via CRAM_OPT_REQUIRED_FIELDS).
83
84* Improved test harness for Windows and some header file problems.
85
86* Fixed bgzip on big endian systems. (Debian bugs 876839, 876840)
87
88
89Technology Demo: CRAM 3.1 and 4.0
90=================================
91
92The current official GA4GH CRAM version is 3.0.
93
94For purposes of *EVALUATION ONLY* this release of io_lib includes CRAM
95version 3.1, with new compression codecs (but is otherwise identical
96file layout to 3.0), and 4.0 with a few additional format
97modifications, such as 64-bit sizes.
98
99They can be turned on using e.g. scramble -V3.1 or scramble -V4.0.
100
101By default enabling either of these will also enable the new codecs,
102bar libbsc (see below for how to compile with this).  These new codecs
103are slower, but will not be used at lighter levels of compression.  So
104for example "scramble -V4.0 -4 in.bam out.cram" will only use the same
105codecs available in CRAM 3.0 bar the fast new rANS variants (rANS++).
106
107Here are some example file sizes and timings with different codecs and
108levels on 10 million NovaSeq reads, with 4 threads (-t4).  Decode
109timing is checked using "scram_flagstat -b -t4".  Tests were performed
110on an Intel i5-4570 processor at 3.2GHz.
111
112Scramble opts.   Size         Enc(s)   Dec(s)    Codecs used
113-V3.0            224743050    12.9      3.8      (default)
114-V3.0 -7jZ       211734953   105.9      5.4      bzip2, lzma
115
116-V3.1 -4         226888980    13.2      3.8      rANS++
117-V3.1            187238214    35.8     12.8      tok3,fqz,rANS++
118-V3.1 -7J        180217109    49.2     25.6      tok3,fqz,rANS++,libbsc
119
120-V4.0 -4         211515487    15.6      3.8      rANS++
121-V4.0            182657527    34.9     13.5      tok3,fqz,rANS++
122-V4.0 -7J        178819704    46.5     19.6      tok3,fqz,rANS++,libbsc
123
124
125Building
126========
127
128Prerequisites
129-------------
130
131You will need a C compiler, a Unix "make" program plus zlib, bzip2 and
132lzma libraries and associated development packages (including C header
133files).  The appropriate operating system package names and comands
134differ per system.  On Debian Linux derived systems use the command
135below (or build and install your own copies from source):
136
137  sudo apt-get install make zlib1g-dev libbz2-dev liblzma-dev
138
139On RedHat derived systems the package names differ:
140
141  sudo yum install make zlib-devel bzip2-devel xz-devel
142
143
144Zlib
145----
146
147This code makes heavy use of the Deflate algorithm, assuming a Zlib
148interface.  The native Zlib bundled with most systems is now rather
149old and better optimised versions exist for certain platforms
150(e.g. using the SSE instructions on Intel and AMD CPUs).
151
152Therefore the --with-zlib=/path/to/zlib configure option may be used
153to point to a different Zlib.  I have tested it with the vanilla zlib,
154Intel's zlib and CloudFlare's Zlib.  Of the three it appears the
155CloudFlare one has the quickest implementation, but mileage may vary
156depending on OS and CPU.
157
158CloudFlare: https://github.com/cloudflare/zlib
159Intel:      https://github.com/jtkukunas/zlib
160Zlib-ng:    https://github.com/Dead2/zlib-ng
161
162The Zlib-ng one needs configuring with --zlib-compat and when you
163build Io_lib you will need to define -DWITH_GZFILEOP too.  It also
164doesn't work well when used in conjunction with LD_PRELOAD. Therefore
165I wouldn't recommend it for now.
166
167If you are using the CloudFlare implementation, you may also want to
168disable the CRC implementation in this code if your CloudFlare zlib
169was built with PCLMUL support as their implementation is faster.
170Otherwise the CRC here is quicker than Zlib's own version.
171Building io_lib with the internal CRC code disabled is done
172with ./configure --disable-own-crc (or CFLAGS=-UIOLIB_CRC).
173
174
175Libdeflate
176----------
177
178The BAM reading and writing also has optional support for the
179libdeflate library (https://github.com/ebiggers/libdeflate).  This can
180be used instead of an optimised zlib (see above), and generally is
181slightly faster.  Build using:
182
183    ./configure --with-libdeflate=/path
184
185
186Git clone
187---------
188
189We recommend building from a release tarball, which has the configure
190script already created for you.  However if you wish to build from the
191latest code and have done a "git clone" then you will need to create
192the configure script yourself using autotools:
193
194  autoreconf -i
195
196This program may not be on your system.  If it fails, then install
197autoconf, automake and libtool packages; see above for example
198OS-specific installation commands.
199
200
201Linux
202-----
203
204We use the GNU autoconf build mechanism.
205
206To build:
207
2081. ./configure
209
210"./configure --help" will give a list of the options for GNU autoconf. For
211modifying the compiler options or flags you may wish to redefine the CC or
212CFLAGS variable.
213
214Eg (in sh or bash):
215   CC=cc CFLAGS=-g ./configure
216
2172. make (or gmake)
218
219This will build the sources.
220
221CFLAGS may also be changed a build time using (eg):
222    make 'CFLAGS=-g ...'
223
2243. make install
225
226The default installation location is /usr/local/bin and /usr/local/lib. These
227can be changed with the --prefix option to "configure".
228
229
230Windows
231-------
232
233Under Microsoft Windows we recommend the use of MSYS and MINGW as a
234build environment.
235
236These contain enough tools to build using the configure script as per
237Linux. The latest msys can be downloaded here:
238
239   http://repo.msys2.org/distrib/msys2-x86_64-latest.exe
240
241Once installed and setup ("pacman -Syu"; close window & relaunch msys;
242"pacman -Syu" again), install mingw64 compilers via "pacman -S
243--needed man base-devel git mingw-w64-x86_64-toolchain".
244
245This should then be sufficient to configure and compile.  However note
246that you may need to use "./configure --disable-shared" for the test
247harness to work due to deficiences in the libtool wrapper script.
248
249If you wish to use Microsoft Visual Studio you may need to add the
250MSVC_includes subdirectory to your C include search path.  This
251adds several missing header files (eg unistd.h and sys/time.h) needed
252to build this software.  We do not have a MSVC project file available
253and have not tested the build under this environment for a number of
254years.
255
256In this case you will also need to copy io_lib/os.h.in to io_lib/os.h
257and either remove the @SET_ENDIAN@ and adjacent @ lines (as these are
258normally filled out for you by autoconf) or add -DNO_AUTOCONF to your
259compiler options.
260
261The code should also build cleanly under a cross-compiler.  This has
262not been tested recently, but a past successful invocation was:
263
264    ./configure \
265            --host=x86_64-w64-mingw32 \
266            --prefix=$DIST \
267            --with-io_lib=$DIST \
268            --with-tcl=$DIST/lib \
269            --with-tk=$DIST/lib \
270            --with-tklib=$DIST/lib/tklib0.5 \
271            --with-zlib=$DIST \
272            LDFLAGS=-L$DIST/lib
273
274with $DIST being pre-populated with already built and installed 3rd
275party dependencies, some from MSYS mentioned above.
276
277
278Libbsc
279------
280
281This is experimental, just to see what we can get with a high quality
282compression engine in CRAM.  It's hard to build right now, especially
283given it's a C++ library and our code is C.  The hacky solution now
284is (linux) e.g.:
285
286  ../configure \
287    CPPFLAGS=-I$HOME/ftp/compression/libbsc \
288    LDFLAGS="-L$HOME/ftp/compression/libbsc -fopenmp" \
289    LIBS=-lstdc++
290
291Enable it using scramble -J, but note this requires experimental CRAM
292versions 3.1 or 4.0.
293
294** Neither of these should be used for production data. **
295
296
297MacOS X
298-------
299
300The configure script should work by default, but if you are attempting
301to build FAT binaries to work on both i386 and ppc targets you'll need
302to disable dependency tracking. Ie:
303
304    CFLAGS="-arch i386 -arch ppc" LDFLAGS="-arch i386 -arch ppc" \
305      ../configure --disable-dependency-tracking
306