1# peak-classifier
2
3## Description
4
5peak-classifier classify ChIP/ATAC-Seq peaks based on features provided in
6a GFF file.
7
8Peaks are provided in a BED file sorted by chromosome and position.  Typically
9these are output from a peak caller such as MACS2, or the differential
10analysis that follows.  The GFF must also be sorted by chromosome, position,
11and subfeature, which is the default for common data sources.
12
13Peak-classifier generates features that are not explicitly identified in the
14GFF, such as introns and potential promoter regions, and outputs the augmented
15feature list to a BED file.  It then identifies overlapping features by
16running bedtools intersect on the augmented feature list and peak list,
17outputting an annotated BED-like TSV file with additional columns to describe
18the feature.  If a peak overlaps multiple features, a separate line is output
19for each.
20
21Alternative approaches to this problem include R scripting with a tool such
22as ChIPpeakAnno or multistage processing of the GFF using awk and bedtools.
23
24In contrast, peak-classifier is a simple Unix command that takes a BED file
25and a GFF file as inputs and reports all peak classifications in a matter of
26seconds.
27
28Admittedly, an optimal C program isn't really necessary to solve this problem,
29since the crappiest implementation I can imagine would not take more than
30hours to run for a typical ATAC-Seq peak set.  However:
31
32    * It's an opportunity to develop and test biolibc code that will be
33      useful for other problems and bigger data
34    * It's more about making peak classification convenient than fast
35    * It never hurts to hone your C skills
36    * There's no such thing as a program that's too fast
37
38## Design and Implementation
39
40The code is organized following basic object-oriented design principals, but
41implemented in C to minimize overhead and keep the source code accessible to
42scientists who don't have time to master the complexities of C++.
43
44Structures are treated as classes, with accessor and mutator functions
45(or macros) provided, so dependent applications and libraries need not access
46structure members directly.  Since the C language cannot enforce this, it's
47up to application programmers to exercise self-discipline.
48
49## Building and installing
50
51peak-classifier is intended to build cleanly in any POSIX environment on
52any CPU architecture.  Please
53don't hesitate to open an issue if you encounter problems on any
54Unix-like system.
55
56Primary development is done on FreeBSD with clang, but the code is frequently
57tested on CentOS, MacOS, and NetBSD as well.  MS Windows is not supported,
58unless using a POSIX environment such as Cygwin or Windows Subsystem for Linux.
59
60The Makefile is designed to be friendly to package managers, such as
61[Debian packages](https://www.debian.org/distrib/packages),
62[FreeBSD ports](https://www.freebsd.org/ports/),
63[MacPorts](https://www.macports.org/), [pkgsrc](https://pkgsrc.org/), etc.
64End users should install via one of these if at all possible.
65
66I maintain a FreeBSD port and a pkgsrc package.
67
68### Installing peak-classifier on FreeBSD:
69
70FreeBSD is a highly underrated platform for scientific computing, with over
711,900 scientific libraries and applications in the FreeBSD ports collection
72(of more than 30,000 total), modern clang compiler, fully-integrated ZFS
73filesystem, and renowned security, performance, and reliability.
74FreeBSD has a somewhat well-earned reputation for being difficult to set up
75and manage compared to user-friendly systems like [Ubuntu](https://ubuntu.com/).
76However, if you're a little bit Unix-savvy, you can very quickly set up a
77workstation, laptop, or VM using
78[desktop-installer](http://www.acadix.biz/desktop-installer.php).  If
79you're new to Unix, you can also reap the benefits of FreeBSD by running
80[GhostBSD](https://ghostbsd.org/), a FreeBSD distribution augmented with a
81graphical installer and management tools.  GhostBSD does not offer as many
82options as desktop-installer, but it may be more comfortable for Unix novices.
83
84```
85pkg install peak-classifier
86```
87
88### Installing via pkgsrc
89
90pkgsrc is a cross-platform package manager that works on any Unix-like
91platform. It is native to [NetBSD](https://www.netbsd.org/) and well-supported
92on [Illumos](https://illumos.org/), [MacOS](https://www.apple.com/macos/),
93[RHEL](https://www.redhat.com)/[CentOS](https://www.centos.org/), and
94many other Linux distributions.
95Using pkgsrc does not require admin privileges.  You can install a pkgsrc
96tree in any directory to which you have write access and easily install any
97of the nearly 20,000 packages in the collection.  The
98[auto-pkgsrc-setup](http://netbsd.org/~bacon/) script can assist you with
99basic setup.
100
101First bootstrap pkgsrc using auto-pkgsrc-setup or any
102other method.  Then run the following commands:
103
104```
105cd pkgsrc-dir/biology/peak-classifier
106bmake install clean
107```
108
109There may also be binary packages available for your platform.  If this is
110the case, you can install by running:
111
112```
113pkgin install peak-classifier
114```
115
116See the [Joyent Cloud Services Site](https://pkgsrc.joyent.com/) for
117available package sets.
118
119### Building peak-classifier locally
120
121Below are cave man install instructions for development purposes, not
122recommended for regular use.
123
124peak-classifier depends on [biolibc](https://github.com/auerlab/biolibc).
125Install biolibc before attempting to build peak-classifier.
126
1271. Clone the repository
1282. Run "make depend" to update Makefile.depend
1293. Run "make install"
130
131The default install prefix is ../local.  Clone peak-classifier, biolibc and dependent
132apps into sibling directories so that ../local represents a common path to all
133of them.
134
135To facilitate incorporation into package managers, the Makefile respects
136standard make/environment variables such as CC, CFLAGS, PREFIX, etc.
137
138Add-on libraries required for the build, such as biolibc, should be found
139under ${LOCALASE}, which defaults to ../local.
140The library, headers, and man pages are installed under
141${DESTDIR}${PREFIX}.  DESTDIR is empty by default and is primarily used by
142package managers to stage installations.  PREFIX defaults to ${LOCALBASE}.
143
144To install directly to /myprefix, assuming biolibc is installed there as well,
145using a make variable:
146
147```
148make LOCALBASE=/myprefix clean depend install
149```
150
151Using an environment variable:
152
153```
154# C-shell and derivatives
155setenv LOCALBASE /myprefix
156make clean depend install
157
158# Bourne shell and derivatives
159LOCALBASE=/myprefix
160export LOCALBASE
161make clean depend install
162```
163
164View the Makefile for full details.
165