• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

.github/H19-Feb-2021-7353

docs/H03-May-2022-5,1574,398

gradle/wrapper/H19-Feb-2021-65

resources_for_CI/H03-May-2022-21

scripts/H03-May-2022-78,63372,818

src/H19-Feb-2021-1,362,3711,206,182

.dockerignoreH A D19-Feb-2021142 43

.dockstore.ymlH A D19-Feb-20212.4 KiB6160

.gitattributesH A D19-Feb-2021951 1310

.gitignoreH A D19-Feb-2021496 4035

.travis.ymlH A D19-Feb-202110.7 KiB225220

AUTHORSH A D19-Feb-20211.3 KiB3632

CODE_OF_CONDUCT.mdH A D19-Feb-20213.2 KiB4728

DockerfileH A D19-Feb-20213.3 KiB8364

README.mdH A D19-Feb-202139.3 KiB647475

build.gradleH A D03-May-202239.3 KiB970817

build_docker.shH A D19-Feb-20216.7 KiB185130

codecov.ymlH A D19-Feb-2021428 3023

gatkH A D19-Feb-202120.6 KiB513389

gradlewH A D19-Feb-20215.8 KiB189128

jitpack.ymlH A D19-Feb-2021150 65

settings.gradleH A D19-Feb-202126 21

testsettings.gradleH A D19-Feb-20213.4 KiB8677

README.md

1[![Build Status](https://travis-ci.com/broadinstitute/gatk.svg?branch=master)](https://travis-ci.com/broadinstitute/gatk)
2[![Maven Central](https://img.shields.io/maven-central/v/org.broadinstitute/gatk.svg)](https://maven-badges.herokuapp.com/maven-central/org.broadinstitute/gatk)
3[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
4
5***Please see the [GATK website](http://www.broadinstitute.org/gatk), where you can download a precompiled executable, read documentation, ask questions, and receive technical support. For GitHub basics, see [here](https://software.broadinstitute.org/gatk/documentation/article?id=23405).***
6
7### GATK 4
8
9This repository contains the next generation of the Genome Analysis Toolkit (GATK). The contents
10of this repository are 100% open source and released under the Apache 2.0 license (see [LICENSE.TXT](https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT)).
11
12GATK4 aims to bring together well-established tools from the [GATK](http://www.broadinstitute.org/gatk) and
13[Picard](http://broadinstitute.github.io/picard/) codebases under a streamlined framework,
14and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using
15[Apache Spark](http://spark.apache.org/). It also contains many newly developed tools not present in earlier
16releases of the toolkit.
17
18## Table of Contents
19* [Requirements](#requirements)
20* [Quick Start Guide](#quickstart)
21* [Downloading GATK4](#downloading)
22* [Building GATK4](#building)
23* [Running GATK4](#running)
24    * [Passing JVM options to gatk](#jvmoptions)
25    * [Passing a configuration file to gatk](#configFileOptions)
26    * [Running GATK4 with inputs on Google Cloud Storage](#gcs)
27    * [Running GATK4 Spark tools locally](#sparklocal)
28    * [Running GATK4 Spark tools on a Spark cluster](#sparkcluster)
29    * [Running GATK4 Spark tools on Google Cloud Dataproc](#dataproc)
30    * [Using R to generate plots](#R)
31    * [GATK Tab Completion for Bash](#tab_completion)
32* [For GATK Developers](#developers)
33    * [General guidelines for GATK4 developers](#dev_guidelines)
34    * [Testing GATK4](#testing)
35    * [Using Git LFS to download and track large test data](#lfs)
36    * [Creating a GATK project in the IntelliJ IDE](#intellij)
37    * [Setting up debugging in IntelliJ](#debugging)
38    * [Updating the Intellij project when dependencies change](#intellij_gradle_refresh)
39    * [Setting up profiling using JProfiler](#jprofiler)
40    * [Uploading Archives to Sonatype](#sonatype)
41    * [Building GATK4 Docker images](#docker_building)
42    * [Releasing GATK4](#releasing_gatk)
43    * [Generating GATK4 documentation](#gatkdocs)
44    * [Generating GATK4 WDL Wrappers](#gatkwdlgen)
45    * [Using Zenhub to track github issues](#zenhub)
46* [Further Reading on Spark](#spark_further_reading)
47* [How to contribute to GATK](#contribute)
48* [Discussions](#discussions)
49* [Authors](#authors)
50* [License](#license)
51
52## <a name="requirements">Requirements</a>
53* To run GATK:
54    * Java 8 is needed to run or build GATK.
55    We recommend either of the following:
56        * OpenJDK 8 with Hotspot from [AdoptOpenJdk](https://adoptopenjdk.net/)
57        * [OracleJDK 8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
58    which requires an Oracle account to download and comes with restrictive [license conditions](https://www.oracle.com/downloads/licenses/javase-license1.html).
59    * Python 2.6 or greater (required to run the `gatk` frontend script)
60    * Python 3.6.2, along with a set of additional Python packages, is required to run some tools and workflows.
61      See [Python Dependencies](#python) for more information.
62    * R 3.2.5 (needed for producing plots in certain tools)
63* To build GATK:
64    * A Java 8 JDK
65    * Git 2.5 or greater
66    * [git-lfs](https://git-lfs.github.com/) 1.1.0 or greater. Required to download the large files used to build GATK, and
67      test files required to run the test suite. Run `git lfs install` after downloading, followed by `git lfs pull` from
68      the root of your git clone to download all of the large files, including those required to run the test suite. The
69      full download is approximately 5 gigabytes. Alternatively, if you are just building GATK and not running the test
70      suite, you can skip this step since the build itself will use git-lfs to download the minimal set of large `lfs`
71      resource files required to complete the build. The test resources will not be downloaded, but this greatly reduces
72      the size of the download.
73    * Gradle 5.6. We recommend using the `./gradlew` script which will
74      download and use an appropriate gradle version automatically (see examples below).
75    * R 3.2.5 (needed for running the test suite)
76* Pre-packaged Docker images with all needed dependencies installed can be found on
77  [our dockerhub repository](https://hub.docker.com/r/broadinstitute/gatk/). This requires a recent version of the
78   docker client, which can be found on the [docker website](https://www.docker.com/get-docker).
79* Python Dependencies:<a name="python"></a>
80    * GATK4 uses the [Conda](https://conda.io/docs/index.html) package manager to establish and manage the
81      Python environment and dependencies required by GATK tools that have a Python dependency. This environment also
82      includes the R dependencies used for plotting in some of the tools. The ```gatk``` environment
83      requires hardware with AVX support for tools that depend on TensorFlow (e.g. CNNScoreVariant). The GATK Docker image
84      comes with the ```gatk``` environment pre-configured.
85      	* At this time, the only supported platforms are 64-bit Linux distributions. The required Conda environment is not
86	  currently supported on OS X/macOS.
87    * To establish the environment when not using the Docker image, a conda environment must first be "created", and
88      then "activated":
89        * First, make sure [Miniconda or Conda](https://conda.io/docs/index.html) is installed (Miniconda is sufficient).
90        * To "create" the conda environment:
91            * If running from a zip or tar distribution, run the command ```conda env create -f gatkcondaenv.yml``` to
92              create the ```gatk``` environment.
93            * If running from a cloned repository, run ```./gradlew localDevCondaEnv```. This generates the Python
94              package archive and conda yml dependency file(s) in the build directory, and also creates (or updates)
95              the local  ```gatk``` conda environment.
96        * To "activate" the conda environment (the conda environment must be activated within the same shell from which
97          GATK is run):
98             * Execute the shell command ```source activate gatk``` to activate the ```gatk``` environment.
99        * See the [Conda](https://conda.io/docs/user-guide/tasks/manage-environments.html) documentation for
100          additional information about using and managing Conda environments.
101
102## <a name="quickstart">Quick Start Guide</a>
103
104* Build the GATK: `./gradlew bundle` (creates `gatk-VERSION.zip` in `build/`)
105* Get help on running the GATK: `./gatk --help`
106* Get a list of available tools: `./gatk --list`
107* Run a tool: `./gatk PrintReads -I src/test/resources/NA12878.chr17_69k_70k.dictFix.bam -O output.bam`
108* Get help on a particular tool: `./gatk PrintReads --help`
109
110## <a name="downloading">Downloading GATK4</a>
111
112You can download and run pre-built versions of GATK4 from the following places:
113
114* A zip archive with everything you need to run GATK4 can be downloaded for each release from the [github releases page](https://github.com/broadinstitute/gatk/releases). We also host unstable archives generated nightly in the Google bucket gs://gatk-nightly-builds.
115
116* You can download a GATK4 docker image from [our dockerhub repository](https://hub.docker.com/r/broadinstitute/gatk/). We also host unstable nightly development builds on [this dockerhub repository](https://hub.docker.com/r/broadinstitute/gatk-nightly/).
117    * Within the docker image, run gatk commands as usual from the default startup directory (/gatk).
118
119## <a name="building">Building GATK4</a>
120
121* **To do a full build of GATK4, first clone the GATK repository using "git clone", then run:**
122
123        ./gradlew bundle
124
125  Equivalently, you can just type:
126
127        ./gradlew
128
129    * This creates a zip archive in the `build/` directory with a name like `gatk-VERSION.zip` containing a complete standalone GATK distribution, including our launcher `gatk`, both the local and spark jars, and this README.
130    * You can also run GATK commands directly from the root of your git clone after running this command.
131    * Note that you *must* have a full git clone in order to build GATK, including the git-lfs files in src/main/resources. The zipped source code alone is not buildable.
132
133* **Other ways to build:**
134    * `./gradlew installDist`
135        * Does a *fast* build that only lets you run GATK tools from inside your git clone, and locally only (not on a cluster). Good for developers!
136    * `./gradlew installAll`
137        * Does a *semi-fast* build that only lets you run GATK tools from inside your git clone, but works both locally and on a cluster. Good for developers!
138    * `./gradlew localJar`
139        * Builds *only* the GATK jar used for running tools locally (not on a Spark cluster). The resulting jar will be in `build/libs` with a name like `gatk-package-VERSION-local.jar`, and can be used outside of your git clone.
140    * `./gradlew sparkJar`
141        * Builds *only* the GATK jar used for running tools on a Spark cluster (rather than locally). The resulting jar will be in `build/libs` with a name like `gatk-package-VERSION-spark.jar`, and can be used outside of your git clone.
142        * This jar will not include Spark and Hadoop libraries, in order to allow the versions of Spark and Hadoop installed on your cluster to be used.
143
144* **To remove previous builds, run:**
145
146        ./gradlew clean
147
148* For faster gradle operations, add `org.gradle.daemon=true` to your `~/.gradle/gradle.properties` file.
149  This will keep a gradle daemon running in the background and avoid the ~6s gradle start up time on every command.
150
151* Gradle keeps a cache of dependencies used to build GATK.  By default this goes in `~/.gradle`.  If there is insufficient free space in your home directory, you can change the location of the cache by setting the `GRADLE_USER_HOME` environment variable.
152
153* The version number is automatically derived from the git history using `git describe`, you can override it by setting the `versionOverride` property.
154  ( `./gradlew -DversionOverride=my_weird_version printVersion` )
155
156## <a name="running">Running GATK4</a>
157
158* The standard way to run GATK4 tools is via the **`gatk`** wrapper script located in the root directory of a clone of this repository.
159    * Requires Python 2.6 or greater (this includes Python 3.x)
160    * You need to have built the GATK as described in the [Building GATK4](#building) section above before running this script.
161    * There are several ways `gatk` can be run:
162        * Directly from the root of your git clone after building
163        * By extracting the zip archive produced by `./gradlew bundle` to a directory, and running `gatk` from there
164        * Manually putting the `gatk` script within the same directory as fully-packaged GATK jars produced by `./gradlew localJar` and/or `./gradlew sparkJar`
165        * Defining the environment variables `GATK_LOCAL_JAR` and `GATK_SPARK_JAR`, and setting them to the paths to the GATK jars produced by `./gradlew localJar` and/or `./gradlew sparkJar`
166    * `gatk` can run non-Spark tools as well as Spark tools, and can run Spark tools locally, on a Spark cluster, or on Google Cloud Dataproc.
167    * ***Note:*** running with `java -jar` directly and bypassing `gatk` causes several important system properties to not get set, including htsjdk compression level!
168
169* For help on using `gatk` itself, run **`./gatk --help`**
170
171* To print a list of available tools, run **`./gatk --list`**.
172    * Spark-based tools will have a name ending in `Spark` (eg., `BaseRecalibratorSpark`). Most other tools are non-Spark-based.
173
174* To print help for a particular tool, run **`./gatk ToolName --help`**.
175
176* To run a non-Spark tool, or to run a Spark tool locally, the syntax is: **`./gatk ToolName toolArguments`**.
177
178* Tool arguments that allow multiple values, such as -I, can be supplied on the command line using a file with the extension ".args". Each line of the file should contain a
179  single value for the argument.
180
181* Examples:
182
183  ```
184  ./gatk PrintReads -I input.bam -O output.bam
185  ```
186
187  ```
188  ./gatk PrintReadsSpark -I input.bam -O output.bam
189  ```
190
191#### <a name="jvmoptions">Passing JVM options to gatk</a>
192
193* To pass JVM arguments to GATK, run `gatk` with the `--java-options` argument:
194
195    ```
196    ./gatk --java-options "-Xmx4G" <rest of command>
197
198    ./gatk --java-options "-Xmx4G -XX:+PrintGCDetails" <rest of command>
199    ```
200#### <a name="configFileOptions">Passing a configuration file to gatk</a>
201
202* To pass a configuration file to GATK, run `gatk` with the `--gatk-config-file` argument:
203
204	```
205	./gatk --gatk-config-file GATKProperties.config <rest of command>
206	```
207
208	An example GATK configuration file is packaged with each release as `GATKConfig.EXAMPLE.properties`
209	This example file contains all current options that are used by GATK and their default values.
210
211#### <a name="gcs">Running GATK4 with inputs on Google Cloud Storage:</a>
212
213* Many GATK4 tools can read BAM or VCF inputs from a Google Cloud Storage bucket. Just use the "gs://" prefix:
214  ```
215  ./gatk PrintReads -I gs://mybucket/path/to/my.bam -L 1:10000-20000 -O output.bam
216  ```
217* ***Important:*** You must set up your credentials first for this to work! There are three options:
218    * Option (a): run in a Google Cloud Engine VM
219        * If you are running in a Google VM then your credentials are already in the VM and will be picked up by GATK, you don't need to do anything special.
220    * Option (b): use your own account
221        * Install [Google Cloud SDK](https://cloud.google.com/sdk/)
222        * Log into your account:
223        ```
224        gcloud auth application-default login
225        ```
226        * Done! GATK will use the application-default credentials you set up there.
227    * Option (c): use a service account
228        * Create a new service account on the Google Cloud web page and download the JSON key file
229        * Install [Google Cloud SDK](https://cloud.google.com/sdk/)
230        * Tell gcloud about the key file:
231        ```
232        gcloud auth activate-service-account --key-file "$PATH_TO_THE_KEY_FILE"
233        ```
234        * Set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to the file
235        ```
236        export GOOGLE_APPLICATION_CREDENTIALS="$PATH_TO_THE_KEY_FILE"
237        ```
238        * Done! GATK will pick up the service account. You can also do this in a VM if you'd like to override the default credentials.
239
240#### <a name="sparklocal">Running GATK4 Spark tools locally:</a>
241
242* GATK4 Spark tools can be run in local mode (without a cluster). In this mode, Spark will run the tool
243  in multiple parallel execution threads using the cores in your CPU. You can control how many threads
244  Spark will use via the `--spark-master` argument.
245
246* Examples:
247
248  Run `PrintReadsSpark` with 4 threads on your local machine:
249  ```
250    ./gatk PrintReadsSpark -I src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam -O output.bam \
251        -- \
252        --spark-runner LOCAL --spark-master 'local[4]'
253  ```
254  Run `PrintReadsSpark` with as many worker threads as there are logical cores on your local machine:
255  ```
256    ./gatk PrintReadsSpark -I src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam -O output.bam \
257        -- \
258        --spark-runner LOCAL --spark-master 'local[*]'
259  ```
260
261* Note that the Spark-specific arguments are separated from the tool-specific arguments by a `--`.
262
263#### <a name="sparkcluster">Running GATK4 Spark tools on a Spark cluster:</a>
264
265**`./gatk ToolName toolArguments -- --spark-runner SPARK --spark-master <master_url> additionalSparkArguments`**
266* Examples:
267
268  ```
269  ./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
270      -- \
271      --spark-runner SPARK --spark-master <master_url>
272  ```
273
274    ```
275    ./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
276      -- \
277      --spark-runner SPARK --spark-master <master_url> \
278      --num-executors 5 --executor-cores 2 --executor-memory 4g \
279      --conf spark.executor.memoryOverhead=600
280    ```
281
282* You can also omit the "--num-executors" argument to enable [dynamic allocation](https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation) if you configure the cluster properly (see the Spark website for instructions).
283* Note that the Spark-specific arguments are separated from the tool-specific arguments by a `--`.
284* Running a Spark tool on a cluster requires Spark to have been installed from http://spark.apache.org/, since
285   `gatk` invokes the `spark-submit` tool behind-the-scenes.
286* Note that the examples above use YARN but we have successfully run GATK4 on Mesos as well.
287
288#### <a name="dataproc">Running GATK4 Spark tools on Google Cloud Dataproc:</a>
289  * You must have a [Google cloud services](https://cloud.google.com/) account, and have spun up a Dataproc cluster
290    in the [Google Developer's console](https://console.developers.google.com). You may need to have the "Allow API access to all Google Cloud services in the same project" option enabled (settable when you create a cluster).
291  * You need to have installed the Google Cloud SDK from [here](https://cloud.google.com/sdk/), since
292    `gatk` invokes the `gcloud` tool behind-the-scenes. As part of the installation, be sure
293      that you follow the `gcloud` setup instructions [here](https://cloud.google.com/sdk/gcloud/). As this library is frequently updated by Google, we recommend updating your copy regularly to avoid any version-related difficulties.
294  * Your inputs to the GATK when running on dataproc are typically in Google Cloud Storage buckets, and should be specified on
295    your GATK command line using the syntax `gs://my-gcs-bucket/path/to/my-file`
296  * You can run GATK4 jobs on Dataproc from your local computer or from the VM (master node) on the cloud.
297
298  Once you're set up, you can run a Spark tool on your Dataproc cluster using a command of the form:
299
300  **`./gatk ToolName toolArguments -- --spark-runner GCS --cluster myGCSCluster additionalSparkArguments`**
301
302  * Examples:
303
304      ```
305      ./gatk PrintReadsSpark \
306          -I gs://my-gcs-bucket/path/to/input.bam \
307          -O gs://my-gcs-bucket/path/to/output.bam \
308          -- \
309          --spark-runner GCS --cluster myGCSCluster
310      ```
311
312      ```
313      ./gatk PrintReadsSpark \
314          -I gs://my-gcs-bucket/path/to/input.bam \
315          -O gs://my-gcs-bucket/path/to/output.bam \
316          -- \
317          --spark-runner GCS --cluster myGCSCluster \
318          --num-executors 5 --executor-cores 2 --executor-memory 4g \
319          --conf spark.yarn.executor.memoryOverhead=600
320      ```
321  * When using Dataproc you can access the web interfaces for YARN, Hadoop and HDFS by opening an SSH tunnel and connecting with your browser.  This can be done easily using included `gcs-cluster-ui` script.
322
323    ```
324    scripts/dataproc-cluster-ui myGCSCluster
325    ```
326    Or see these [these instructions](https://cloud.google.com/dataproc/cluster-web-interfaces) for more details.
327  * Note that the spark-specific arguments are separated from the tool-specific arguments by a `--`.
328  * If you want to avoid uploading the GATK jar to GCS on every run, set the `GATK_GCS_STAGING`
329    environment variable to a bucket you have write access to (eg., `export GATK_GCS_STAGING=gs://<my_bucket>/`)
330  * Dataproc Spark clusters are configured with [dynamic allocation](https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation) so you can omit the "--num-executors" argument and let YARN handle it automatically.
331
332#### <a name="R">Using R to generate plots</a>
333Certain GATK tools may optionally generate plots using the R installation provided within the conda environment.  If you are uninterested in plotting, R is still required by several of the unit tests.  Plotting is currently untested and should be viewed as a convenience rather than a primary output.
334
335#### <a name="tab_completion">Bash Command-line Tab Completion (BETA)</a>
336
337* A tab completion bootstrap file for the bash shell is now included in releases.  This file allows the command-line shell to complete GATK run options in a manner equivalent to built-in command-line tools (e.g. grep).
338
339* This tab completion functionality has only been tested in the bash shell, and is released as a beta feature.
340
341* To enable tab completion for the GATK, open a terminal window and source the included tab completion script:
342
343```
344source gatk-completion.sh
345```
346
347* Sourcing this file will allow you to press the tab key twice to get a list of options available to add to your current GATK command.  By default you will have to source this file once in each command-line session, then for the rest of the session the GATK tab completion functionality will be available.  GATK tab completion will be available in that current command-line session only.
348
349* Note that you must have already started typing an invocation of the GATK (using gatk) for tab completion to initiate:
350
351```
352./gatk <TAB><TAB>
353```
354
355* We recommend adding a line to your bash settings file (i.e. your ~/.bashrc file) that sources the tab completion script.  To add this line to your bash settings / bashrc file you can use the following command:
356
357```
358echo "source <PATH_TO>/gatk-completion.sh" >> ~/.bashrc
359```
360
361* Where ```<PATH_TO>``` is the fully qualified path to the ```gatk-completion.sh``` script.
362
363## <a name="developers">For GATK Developers</a>
364
365#### <a name="dev_guidelines">General guidelines for GATK4 developers</a>
366
367* **Do not put private or restricted data into the repo.**
368
369* **Try to keep datafiles under 100kb in size.** Larger test files should go into `src/test/resources/large` (and subdirectories) so that they'll be stored and tracked by git-lfs as described [above](#lfs).
370
371* GATK4 is Apache 2.0 licensed.  The license is in the top level LICENSE.TXT file.  Do not add any additional license text or accept files with a license included in them.
372
373* Each tool should have at least one good end-to-end integration test with a check for expected output, plus high-quality unit tests for all non-trivial utility methods/classes used by the tool. Although we have no specific coverage target, coverage should be extensive enough that if tests pass, the tool is guaranteed to be in a usable state.
374
375* All newly written code must have good test coverage (>90%).
376
377* All bug fixes must be accompanied by a regression test.
378
379* All pull requests must be reviewed before merging to master (even documentation changes).
380
381* Don't issue or accept pull requests that introduce warnings. Warnings must be addressed or suppressed.
382
383* Don't issue or accept pull requests that significantly decrease coverage (less than 1% decrease is sort of tolerable).
384
385* Don't use `toString()` for anything other than human consumption (ie. don't base the logic of your code on results of `toString()`.)
386
387* Don't override `clone()` unless you really know what you're doing. If you do override it, document thoroughly. Otherwise, prefer other means of making copies of objects.
388
389* For logging, use [org.apache.logging.log4j.Logger](https://logging.apache.org/log4j/2.0/log4j-api/apidocs/org/apache/logging/log4j/Logger.html)
390
391* We mostly follow the [Google Java Style guide](https://google.github.io/styleguide/javaguide.html)
392
393* Git: Don't push directly to master - make a pull request instead.
394
395* Git: Rebase and squash commits when merging.
396
397* If you push to master or mess up the commit history, you owe us 1 growler or tasty snacks at happy hour. If you break the master build, you owe 3 growlers (or lots of tasty snacks). Beer may be replaced by wine (in the color and vintage of buyer's choosing) in proportions of 1 growler = 1 bottle.
398
399#### <a name="testing">Testing GATK</a>
400
401* Before running the test suite, be sure that you've installed `git lfs` and downloaded the large test data, following the [git lfs setup instructions](#lfs)
402
403* To run the test suite, run **`./gradlew test`**.
404    * Test report is in `build/reports/tests/test/index.html`.
405    * What will happen depends on the value of the `TEST_TYPE` environment variable:
406       * unset or any other value         : run non-cloud unit and integration tests, this is the default
407       * `cloud`, `unit`, `integration`, `conda`, `spark`   : run only the cloud, unit, integration, conda (python + R), or Spark tests
408       * `all`                            : run the entire test suite
409    * Cloud tests require being logged into `gcloud` and authenticated with a project that has access
410      to the cloud test data.  They also require setting several certain environment variables.
411      * `HELLBENDER_JSON_SERVICE_ACCOUNT_KEY` : path to a local JSON file with [service account credentials](https://cloud.google.com/storage/docs/authentication#service_accounts)
412      * `HELLBENDER_TEST_PROJECT` : your google cloud project
413      * `HELLBENDER_TEST_STAGING` : a gs:// path to a writable location
414      * `HELLBENDER_TEST_INPUTS` : path to cloud test data, ex: gs://hellbender/test/resources/
415    * Setting the environment variable `TEST_VERBOSITY=minimal` will produce much less output from the test suite
416
417* To run a subset of tests, use gradle's test filtering (see [gradle doc](https://docs.gradle.org/current/userguide/java_plugin.html)):
418    * You can use `--tests` with a wildcard to run a specific test class, method, or to select multiple test classes:
419        * `./gradlew test --tests *SomeSpecificTestClass`
420        * `./gradlew test --tests *SomeTest.someSpecificTestMethod`
421        * `./gradlew test --tests all.in.specific.package*`
422
423* To run tests and compute coverage reports, run **`./gradlew jacocoTestReport`**. The report is then in `build/reports/jacoco/test/html/index.html`.
424  (IntelliJ has a good coverage tool that is preferable for development).
425
426* We use [Travis-CI](https://travis-ci.org/broadinstitute/gatk) as our continuous integration provider.
427
428    * Before merging any branch make sure that all required tests pass on travis.
429    * Every travis build will upload the test results to our GATK Google Cloud Storage bucket.
430      A link to the uploaded report will appear at the very bottom of the travis log.
431      Look for the line that says `See the test report at`.
432      If TestNG itself crashes there will be no report generated.
433
434* We use [Broad Jenkins](https://gatk-jenkins.broadinstitute.org/view/Performance/) for our long-running tests and performance tests.
435    * To add a performance test (requires Broad-ID), you need to make a "new item" in Jenkins and make it a "copy" instead of a blank project. You need to base it on either the "-spark-" jobs or the other kind of jobs and alter the commandline.
436
437* To output stack traces for `UserException` set the environment variable `GATK_STACKTRACE_ON_USER_EXCEPTION=true`
438
439#### <a name="lfs">Using Git LFS to download and track large test data</a>
440
441We use [git-lfs](https://git-lfs.github.com/) to version and distribute test data that is too large to check into our repository directly. You must install and configure it in order to be able to run our test suite.
442
443* After installing [git-lfs](https://git-lfs.github.com/), run `git lfs install`
444    * This adds hooks to your git configuration that will cause git-lfs files to be checked out for you automatically in the future.
445
446* To manually retrieve the large test data, run `git lfs pull` from the root of your GATK git clone.
447    * The download size is approximately 5 gigabytes.
448
449* To add a new large file to be tracked by git-lfs, simply:
450    * Put the new file(s) in `src/test/resources/large` (or a subdirectory)
451    * `git add` the file(s), then `git commit -a`
452    * That's it! Do ***not*** run `git lfs track` on the files manually: all files in `src/test/resources/large` are tracked by git-lfs automatically.
453
454#### <a name="intellij">Creating a GATK project in the IntelliJ IDE (last tested with version 2016.2.4):</a>
455
456* Ensure that you have `gradle` and the Java 8 JDK installed
457
458* You may need to install the TestNG and Gradle plugins (in preferences)
459
460* Clone the GATK repository using git
461
462* In IntelliJ, click on "Import Project" in the home screen or go to File -> New... -> Project From Existing Sources...
463
464* Select the root directory of your GATK clone, then click on "OK"
465
466* Select "Import project from external model", then "Gradle", then click on "Next"
467
468* Ensure that "Gradle project" points to the build.gradle file in the root of your GATK clone
469
470* Select "Use auto-import" and "Use default gradle wrapper".
471
472* Make sure the Gradle JVM points to Java 1.8. You may need to set this manually after creating the project, to do so find the gradle settings by clicking the wrench icon in the gradle tab on the right bar, from there edit "Gradle JVM" argument to point to Java 1.8.
473
474* Click "Finish"
475
476* After downloading project dependencies, IntelliJ should open a new window with your GATK project
477
478* Make sure that the Java version is set correctly by going to File -> "Project Structure" -> "Project". Check that the "Project SDK" is set to your Java 1.8 JDK, and "Project language level" to 8 (you may need to add your Java 8 JDK under "Platform Settings" -> SDKs if it isn't there already). Then click "Apply"/"Ok".
479
480#### <a name="debugging">Setting up debugging in IntelliJ</a>
481
482* Follow the instructions above for creating an IntelliJ project for GATK
483
484* Go to Run -> "Edit Configurations", then click "+" and add a new "Application" configuration
485
486* Set the name of the new configuration to something like "GATK debug"
487
488* For "Main class", enter `org.broadinstitute.hellbender.Main`
489
490* Ensure that "Use classpath of module:" is set to use the "gatk" module's classpath
491
492* Enter the arguments for the command you want to debug in "Program Arguments"
493
494* Click "Apply"/"Ok"
495
496* Set breakpoints, etc., as desired, then select "Run" -> "Debug" -> "GATK debug" to start your debugging session
497
498* In future debugging sessions, you can simply adjust the "Program Arguments" in the "GATK debug" configuration as needed
499
500#### <a name="intellij_gradle_refresh">Updating the Intellij project when dependencies change</a>
501If there are dependency changes in `build.gradle` it is necessary to refresh the gradle project. This is easily done with the following steps.
502
503* Open the gradle tool window  ( "View" -> "Tool Windows" -> "Gradle" )
504* Click the refresh button in the Gradle tool window.  It is in the top left of the gradle view and is represented by two blue arrows.
505
506#### <a name="jprofiler">Setting up profiling using JProfiler</a>
507
508   * Running JProfiler standalone:
509       * Build a full GATK4 jar using `./gradlew localJar`
510       * In the "Session Settings" window, select the GATK4 jar, eg. `~/gatk/build/libs/gatk-package-4.alpha-196-gb542813-SNAPSHOT-local.jar` for "Main class or executable JAR" and enter the right "Arguments"
511       * Under "Profiling Settings", select "sampling" as the "Method call recording" method.
512
513   * Running JProfiler from within IntelliJ:
514       * JProfiler has great integration with IntelliJ (we're using IntelliJ Ultimate edition) so the setup is trivial.
515       * Follow the instructions [above](#intellij) for creating an IntelliJ project for GATK
516       * Right click on a test method/class/package and select "Profile"
517
518#### <a name="sonatype">Uploading Archives to Sonatype (to make them available via maven central)</a>
519To upload snapshots to Sonatype you'll need the following:
520
521* You must have a registered account on the sonatype JIRA (and be approved as a gatk uploader)
522* You need to configure several additional properties in your `/~.gradle/gradle.properties` file
523
524* If you want to upload a release instead of a snapshot you will additionally need to have access to the gatk signing key and password
525
526```
527#needed for snapshot upload
528sonatypeUsername=<your sonatype username>
529sonatypePassword=<your sonatype password>
530
531#needed for signing a release
532signing.keyId=<gatk key id>
533signing.password=<gatk key password>
534signing.secretKeyRingFile=/Users/<username>/.gnupg/secring.gpg
535```
536
537To perform an upload, use
538```
539./gradlew uploadArchives
540```
541
542Builds are considered snapshots by default.  You can mark a build as a release build by setting `-Drelease=true`.
543The archive name is based off of `git describe`.
544
545#### <a name="docker_building">Building GATK4 Docker images</a>
546
547Please see the [the Docker README](scripts/docker/README.md) in ``scripts/docker``.  This has instructions for the Dockerfile in the root directory.
548
549#### <a name="releasing_gatk">Releasing GATK4</a>
550
551Please see the [How to release GATK4](https://github.com/broadinstitute/gatk/wiki/How-to-release-GATK4) wiki article for instructions on releasing GATK4.
552
553#### <a name="gatkdocs">Generating GATK4 documentation</a>
554
555To generate GATK documentation, run `./gradlew gatkDoc`
556
557* Generated docs will be in the `build/docs/gatkdoc` directory.
558
559#### <a name="gatkwdlgen">Generating GATK4 WDL Wrappers</a>
560
561* A WDL wrapper can be generated for any GATK4 tool that is annotated for WDL generation (see the wiki article
562[How to Prepare a GATK tool for WDL Auto Generation](https://github.com/broadinstitute/gatk/wiki/How-to-Prepare-a-GATK-tool-for-WDL-Auto-Generation))
563to learn more about WDL annotations.
564
565* To generate the WDL Wrappers, run `./gradlew gatkWDLGen`. The generated WDLs and accompanying JSON input files can
566be found in the `build/docs/wdlGen` folder.
567
568* To generate WDL Wrappers and validate the resulting outputs, run `./gradlew gatkWDLGenValidation`.
569Running this task requires a local [cromwell](https://github.com/broadinstitute/cromwell) installation, and environment
570variables `CROMWELL_JAR` and `WOMTOOL_JAR` to be set to the full pathnames of the `cromwell` and `womtool` jar files.
571If no local install is available, this task will run automatically on travis in a separate job whenever a PR is submitted.
572
573* WDL wrappers for each GATK release are published to the [gatk-tool-wdls](https://github.com/broadinstitute/gatk-tool-wdls) repository.
574Only tools that have been annotated for WDL generation will show up there.
575
576#### <a name="zenhub">Using Zenhub to track github issues</a>
577
578We use [Zenhub](https://www.zenhub.com/) to organize and track github issues.
579
580* To add Zenhub to github, go to the [Zenhub home page](https://www.zenhub.com/) while logged in to github, and click "Add Zenhub to Github"
581
582* Zenhub allows the GATK development team to assign time estimates to issues, and to mark issues as Triaged/In Progress/In Review/Blocked/etc.
583
584## <a name="spark_further_reading">Further Reading on Spark</a>
585
586[Apache Spark](https://spark.apache.org/) is a fast and general engine for large-scale data processing.
587GATK4 can run on any Spark cluster, such as an on-premise Hadoop cluster with HDFS storage and the Spark
588runtime, as well as on the cloud using Google Dataproc.
589
590In a cluster scenario, your input and output files reside on HDFS, and Spark will run in a distributed fashion on the cluster.
591The Spark documentation has a good [overview of the architecture](https://spark.apache.org/docs/latest/cluster-overview.html).
592
593Note that if you don't have a dedicated cluster you can run Spark in
594[standalone mode](https://spark.apache.org/docs/latest/spark-standalone.html) on a single machine, which exercises
595the distributed code paths, albeit on a single node.
596
597While your Spark job is running, the [Spark UI](http://spark.apache.org/docs/latest/monitoring.html) is an excellent place to monitor the  progress.
598Additionally, if you're running tests, then by adding `-Dgatk.spark.debug=true` you can run a single Spark test and
599look at the Spark UI (on [http://localhost:4040/](http://localhost:4040/)) as it runs.
600
601You can find more information about tuning Spark and choosing good values for important settings such as the number
602of executors and memory settings at the following:
603
604* [Tuning Spark](https://spark.apache.org/docs/latest/tuning.html)
605* [How-to: Tune Your Apache Spark Jobs (Part 1)](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/)
606* [How-to: Tune Your Apache Spark Jobs (Part 2)](http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/)
607
608## <a name="contribute">How to contribute to GATK</a>
609(Note: section inspired by, and some text copied from, [Apache Parquet](https://github.com/apache/parquet-mr))
610
611We welcome all contributions to the GATK project. The contribution can be a [issue report]( https://github.com/broadinstitute/gatk/issues)
612or a [pull request](https://github.com/broadinstitute/gatk/pulls). If you're not a committer, you will
613need to [make a fork](https://help.github.com/articles/fork-a-repo/) of the gatk repository
614and [issue a pull request](https://help.github.com/articles/be-social/) from your fork.
615
616For ideas on what to contribute, check issues labeled ["Help wanted (Community)"](https://github.com/broadinstitute/gatk/issues?q=is%3Aopen+is%3Aissue+label%3A%22Help+Wanted+%28Community%29%22). Comment on the issue to indicate you're interested in contibuting code and for sharing your questions and ideas.
617
618To contribute a patch:
619* Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
620* Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on [forking a repo](https://help.github.com/articles/fork-a-repo/) and [sending a pull request](https://help.github.com/articles/be-social/). If applicable, include the issue number in the pull request name.
621* Make sure that your code passes all our tests. You can run the tests with `./gradlew test` in the root directory.
622* Add tests for all new code you've written. We prefer unit tests but high quality integration tests that use small amounts of data are acceptable.
623* Follow the [**General guidelines for GATK4 developers**](https://github.com/broadinstitute/gatk#general-guidelines-for-gatk4-developers).
624
625We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some things to consider:
626* Write tests for all new code.
627* Document all classes and public methods.
628* For all public methods, check validity of the arguments and throw `IllegalArgumentException` if invalid.
629* Use braces for control constructs, `if`, `for` etc.
630* Make classes, variables, parameters etc `final` unless there is a strong reason not to.
631* Give your operators some room. Not `a+b` but `a + b` and not `foo(int a,int b)` but `foo(int a, int b)`.
632* Generally speaking, stick to the [Google Java Style guide](https://google.github.io/styleguide/javaguide.html)
633
634Thank you for getting involved!
635
636## <a name="discussions">Discussions</a>
637* [GATK forum](https://gatk.broadinstitute.org/hc/en-us/community/topics) for general discussions on how to use the GATK and support questions.
638* [Issue tracker](https://github.com/broadinstitute/gatk/issues) to report errors and enhancement ideas.
639* Discussions also take place in [GATK pull requests](https://github.com/broadinstitute/gatk/pulls)
640
641## <a name="authors">Authors</a>
642The authors list is maintained in the [AUTHORS](https://github.com/broadinstitute/gatk/edit/master/AUTHORS) file.
643See also the [Contributors](https://github.com/broadinstitute/gatk/graphs/contributors) list at github.
644
645## <a name="license">License</a>
646Licensed under the Apache 2.0 License. See the [LICENSE.txt](https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT) file.
647