ENCODE ATAC-seq analyzing pipeline hands-on tutorial

This pipeline is very practical, though the installation and usage is a hassle, once you successfully install it and prepared all the files, just sit and wait for the result. It will take in ATAC-seq fastq reads file, and output mapped bam, peak calling, etc. Details about the pipeline can be found here.


It’s recommended to install the pipeline using Conda, and I think it is the easiest way, here listed detail steps of installation, please follow it step-by-step, I copy/paste important steps here and point out the possible errors. (The whole instruction is for the Python3 version)

If you install the Conda environment at some time point, the package croo is outdated, please updated it within your environment.

conda uninstall croo
pip install croo==0.3.4

Since fastq files are large and the pipeline is computational heavy, the installation instruction is for installation on clusters which is Linux.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-4.6.14-Linux-x86_64.sh

# Use default except for the following two questions:

Do you accept the license terms? [yes|no]
[no] >>> yes

Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes

IMPORTANT: Close your session and re-login. If you skip this step then pipeline’s Conda environment will be messed up with base Conda environment.

# Disable auto-activation of base Conda environment.
conda config --set auto_activate_base false

IMPORTANT: Close your session and re-login.

# Install the environment
bash scripts/uninstall_conda_env.sh  # uninstall it for clean-install
bash scripts/install_conda_env.sh

After everything is finished, below message will show up.

=== All done successfully ===

Prepare files before running the pipeline (slurm)

– JSON file
– ~/.caper/default.conf

– backends folder
– atac.wdl

– atac.croo.v4.json

JSON file is a config file where you specify all genomic data files, parameters and metadata for running the pipeline. Please see this link for a detailed document, and here are the short version and full version of the template.


### A screenshot of how the JSON file looks like (not complete) ###
# Here are the lines you need to modify in the JSON file.

"atac.genome_tsv" : "https://storage.googleapis.com/encode-pipeline-genome-data/hg38_caper.tsv",
"atac.blacklist": "/your/path/ENCFF419RSJ_blackList.bed.gz",
"atac.paired_end" : true,
"atac.fastqs_rep1_R1" : [ "rep1_R1_L1.fastq.gz", "rep1_R1_L2.fastq.gz", "rep1_R1_L3.fastq.gz" ],
"atac.fastqs_rep1_R2" : [ "rep1_R2_L1.fastq.gz", "rep1_R2_L2.fastq.gz", "rep1_R2_L3.fastq.gz" ],
"atac.fastqs_rep2_R1" : [ "rep2_R1_L1.fastq.gz", "rep2_R1_L2.fastq.gz" ],
"atac.fastqs_rep2_R2" : [ "rep2_R2_L1.fastq.gz", "rep2_R2_L2.fastq.gz" ],
"atac.title" : "Your atac-seq title",
"atac.description" : "Your atac-seq description",

To save you the trouble, if you are mapping to hg38, just use this link in the above template for atac.genome_tsv
The blacklist file can be downloaded from ENCODE (hg38). [link]
As mentioned above, all paths should use the absolute path.

~/.caper/default.conf file is the config file for caper, please see detailed document [link] for other cluster systems, here I will show you how to edit it for slurm.
My logic is, I have a slurm command submitted for each of my samples, after being assigned a computing node, the computing node is my new ‘local’, all my input/tmp/output files and the actual computing happen in the assigned $TMPDIR folder. Below shows how exactly the file looks like, just change your home directory address.


# DO NOT use /tmp here
# Caper stores all important temp files and cached big data files here
# If not defined, Caper will make .caper_tmp/ on your local output directory
# which is defined by out-dir, --out-dir or $CWD
# Use a local absolute path here
tmp-dir=    # <-- left blank intentionally


Belows files are all in the ATAC-seq pipeline Github website, you can just download the whole thing. But just in case you install the pipeline using Conda and you can’t find it, I added the individual link as well.
backends folder can be downloaded at this link.
atac.wdl file is the whole pipeline file which can be downloaded at this link.
atac.croo.v4.json file is a JSON file to define how to output results, it can be downloaded here.

Some times error occurs because no backends folder specified or cromwell is not installed properly.

Running the pipeline and example slurm submission file

Finally, after everything is in place, here is the example slurm for running one sample, you can refer to this blog for how to loop through multiple samples.

Here are some notes for the slurm:
caper init local is a must, as I explained previously. Many people encounter errors when using the slurm is because of this.
– Always rsync all the files into $TMPDIR. This will make your job run must faster and it won’t cause constant writing between the local node and the computing node to crash the whole system.
– Please carefully specify the intermediate and final path in both caper and croo commands. It is always in sub-folders in $TMPDIR except for the final output defined by --out-dir $SLURM_SUBMIT_DIR/pipelineOutput/Z.
Z stands for your sample name, change as needed.

And the last step is submitting your slurm command file.

sbatch atacSeqPipeline.sh