ENCODE ATAC-seq analyzing pipeline hands-on tutorial

This pipeline is very practical, though the installation and usage is a hassle, once you successfully install it and prepared all the files, just sit and wait for the result. It will take in ATAC-seq fastq reads file, and output mapped bam, peak calling, etc. Details about the pipeline can be found here.

Installation

It’s recommended to install the pipeline using Conda, and I think it is the easiest way, here listed detail steps of installation, please follow it step-by-step, I copy/paste important steps here and point out the possible errors. (The whole instruction is for the Python3 version)

If you install the Conda environment at some time point, the package croo is outdated, please updated it within your environment.

conda uninstall croo
pip install croo==0.3.4

IMPORTANT: DO NOT USE A GLOBALLY INSTALLED CONDA. INSTALL YOUR OWN MINICONDA3 UNDER YOUR HOME DIRECTORY BY FOLLOWING THE INSTRUCTION HERE.
Since fastq files are large and the pipeline is computational heavy, the installation instruction is for installation on clusters which is Linux.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-4.6.14-Linux-x86_64.sh

# Use default except for the following two questions:

Do you accept the license terms? [yes|no]
[no] >>> yes

Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes

IMPORTANT: Close your session and re-login. If you skip this step then pipeline’s Conda environment will be messed up with base Conda environment.

# Disable auto-activation of base Conda environment.
conda config --set auto_activate_base false

IMPORTANT: Close your session and re-login.

# Install the environment
bash scripts/uninstall_conda_env.sh  # uninstall it for clean-install
bash scripts/install_conda_env.sh

After everything is finished, below message will show up.

=== All done successfully ===

Prepare files before running the pipeline (slurm)

– JSON file
– ~/.caper/default.conf

– backends folder
– atac.wdl

– atac.croo.v4.json

JSON file is a config file where you specify all genomic data files, parameters and metadata for running the pipeline. Please see this link for a detailed document, and here are the short version and full version of the template.

IMPORTANT: ALWAYS USE ABSOLUTE PATHS.

### A screenshot of how the JSON file looks like (not complete) ###
# Here are the lines you need to modify in the JSON file.

"atac.genome_tsv" : "https://storage.googleapis.com/encode-pipeline-genome-data/hg38_caper.tsv",
"atac.blacklist": "/your/path/ENCFF419RSJ_blackList.bed.gz",
"atac.paired_end" : true,
"atac.fastqs_rep1_R1" : [ "rep1_R1_L1.fastq.gz", "rep1_R1_L2.fastq.gz", "rep1_R1_L3.fastq.gz" ],
"atac.fastqs_rep1_R2" : [ "rep1_R2_L1.fastq.gz", "rep1_R2_L2.fastq.gz", "rep1_R2_L3.fastq.gz" ],
"atac.fastqs_rep2_R1" : [ "rep2_R1_L1.fastq.gz", "rep2_R1_L2.fastq.gz" ],
"atac.fastqs_rep2_R2" : [ "rep2_R2_L1.fastq.gz", "rep2_R2_L2.fastq.gz" ],
"atac.title" : "Your atac-seq title",
"atac.description" : "Your atac-seq description",

To save you the trouble, if you are mapping to hg38, just use this link in the above template for atac.genome_tsv
The blacklist file can be downloaded from ENCODE (hg38). [link]
As mentioned above, all paths should use the absolute path.

~/.caper/default.conf file is the config file for caper, please see detailed document [link] for other cluster systems, here I will show you how to edit it for slurm.
My logic is, I have a slurm command submitted for each of my samples, after being assigned a computing node, the computing node is my new ‘local’, all my input/tmp/output files and the actual computing happen in the assigned $TMPDIR folder. Below shows how exactly the file looks like, just change your home directory address.

backend=local

# DO NOT use /tmp here
# Caper stores all important temp files and cached big data files here
# If not defined, Caper will make .caper_tmp/ on your local output directory
# which is defined by out-dir, --out-dir or $CWD
# Use a local absolute path here
tmp-dir=    # <-- left blank intentionally

cromwell=/your/home/directory/.caper/cromwell_jar/cromwell-47.jar
womtool=/your/home/directory/.caper/womtool_jar/womtool-47.jar

Belows files are all in the ATAC-seq pipeline Github website, you can just download the whole thing. But just in case you install the pipeline using Conda and you can’t find it, I added the individual link as well.
backends folder can be downloaded at this link.
atac.wdl file is the whole pipeline file which can be downloaded at this link.
atac.croo.v4.json file is a JSON file to define how to output results, it can be downloaded here.

Some times error occurs because no backends folder specified or cromwell is not installed properly.

Running the pipeline and example slurm submission file

Finally, after everything is in place, here is the example slurm for running one sample, you can refer to this blog for how to loop through multiple samples.

Here are some notes for the slurm:
caper init local is a must, as I explained previously. Many people encounter errors when using the slurm is because of this.
– Always rsync all the files into $TMPDIR. This will make your job run must faster and it won’t cause constant writing between the local node and the computing node to crash the whole system.
– Please carefully specify the intermediate and final path in both caper and croo commands. It is always in sub-folders in $TMPDIR except for the final output defined by --out-dir $SLURM_SUBMIT_DIR/pipelineOutput/Z.
Z stands for your sample name, change as needed.

And the last step is submitting your slurm command file.

sbatch atacSeqPipeline.sh

Source:
https://www.encodeproject.org/atac-seq/#overview
https://github.com/ENCODE-DCC/atac-seq-pipeline
https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/install_conda.md
https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md
https://encode-dcc.github.io/wdl-pipelines/install.html


Unix tar compress and decompress tar.gz files or folder

Compress multiple files:

tar -czvf name-of-archive.tar.gz file1 file2 folder1 folder2 

Compress folder:

tar -czvf name-of-archive.tar.gz /path/to/directory-or-file

Compress a folder while excluding some files:

tar -czvf archive.tar.gz /home/ubuntu --exclude=*.mp4

Decompress tar.gz file:

tar -xzvf archive.tar.gz
tar -xzvf archive.tar.gz -C targetDir
tar -xzvf archive.tar.gz -directory targetDir

Souce:
https://www.howtogeek.com/248780/how-to-compress-and-extract-files-using-the-tar-command-on-linux/

Funseq2 hands-on tutorial

Only the most essentials are kept, mostly explanation of installation errors and how to fix them. For detailed instructions, please refer to Funseq2 Protocol.
It’s recommended to install things to a designated Conda environment, for some tools, I will only list Conda installation command.

Dependances to install first

TPMpvalue [link]
– Perl package Parallel::ForkManager
[link]
– bedtools/tabix/bigwigAverageOverBed
– VAT (snpMapper, indelMapper Module) [link]
– sed/awk/grep
(those should have been pre-installed already)

TPMpvalue
Download the TFM-Pvalue.tar.gz from its website, tar -xf TFM-Pvalue.tar.gz to decompress the package and cd into the folder. To compile the package type as follow.

tar -xf TFM-Pvalue.tar.gz
cd TFM-Pvalue
make

After compiling the tool successfully, either directly copy the compiled binary files into your /usr/bin folder, or add the path to the environmental variable as below.

export PATH=$PATH:/where/you/put/binary/files

Possible Error:
g++ -O3 -DJASPAR=1 -DPROGRAM=0 TFMpvalue.cpp Matrix.cpp ArgumentException.cpp FileException.cpp ParseException.cpp -o TFMpvalue-pv2sc
TFMpvalue.cpp: In function ‘void arguments(int, char* const*)’:
TFMpvalue.cpp:503:45: error: ‘getopt’ was not declared in this scope

Solution:
In the TFM-Pvalue folder, find the file TFPpvaue.cpp, uncomment line 16 in that file and change GetOpt.h into getopt.h.

Perl package Parallel::ForkManager
Here is the installation using Conda.

conda install -n yourEnvr -c bioconda perl-parallel-forkmanager

bedtools/tabix/bigwigAverageOverBed
Installation using Conda.

conda install -n yourEnvr -c bioconda bedtools
conda install -n yourEnvr -c bioconda tabix
conda install -n yourEnvr -c bioconda ucsc-bigwigaverageoverbed

VAT
The easiest way is to download pre-built binaries, save them into your local /usr/bin/ or add the path to your environmental variable and make them executable.

export PATH=$PATH:/where/you/put/binary/files
chmod +x snpMapper
chmod +x indelMapper # <- to make them executable

Download the Funseq2 and pre-processed data

Download the Funse2 from its newest update:
https://github.com/khuranalab/FunSeq2_DC

Download the latest data needed for Funseq2 and put everything into the data_context folder, unzip XXX.tar.gz folders within data_context
https://khuranalab.med.cornell.edu/data_DC3.html

You can also download an older build of the data set from https://khuranalab.med.cornell.edu/data.html
The content can be downloaded in a compressed folder from that page.

Prepare funseq2.sh and config.txt file to run Funseq2

In your FunSeq2_DC folder, you need to modify two files before starting: funseq2.sh and contig.txt.

### In funseq2.sh ###

# keep "data_context/user_annotations" intact
user_anno=/your/destination/of/data_context/user_annotations

### In contig.txt ###

# Change the file path first and then change the annotation files you want to use accordingly.
file_path=/your/destination/of/data_context

The destination user_anno can be specified in the running command by using the option -ua. But if you don’t want to type that every time, you can just change it in funseq2.sh.

If running on cluster and using Conda, remember to export the $TMPDIR to $PERL5LIB

export PERL5LIB=$PERL5LIB:$TMPDIR

Running FunSeq2

After all the preparation, we are finally here. To run FunSeq2 is simple.

funseq2.sh -f file -maf MAF -m <1/2> -len length_cut -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file -cls file -exf <rpkm/raw> -p int -cancer cancer_type -s score -uw -ua user_annotations_directory -db
FunSeq2 options

Source:
https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpbi.23
https://github.com/gersteinlab/FunSeq2/issues/4
http://info.gersteinlab.org/Funseq2#E._Input_files
https://anaconda.org/bioconda/perl-parallel-forkmanager
https://metacpan.org/pod/release/SZABGAB/Parallel-ForkManager-1.03/lib/Parallel/ForkManager.pm
https://anaconda.org/bioconda/bedtools
https://anaconda.org/bioconda/tabix
https://anaconda.org/bioconda/ucsc-bigwigaverageoverbed
https://github.com/khuranalab/FunSeq2_DC
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0480-5
https://science.sciencemag.org/content/342/6154/1235587.long

Extract/subset bigwig file for a given genomic region

This is a solution in python version (3.0+) using a package called pyBigWig to extract a given genomic region from a whole genome bigwig file.

Prepare your input bigwig file:

import pyBigWig

# First open bigwig file
bwIn = pyBigWig.open('input.bw')

# check bigwig file header
print(bwIn.header())

Prepare output, since your output doesn’t have a header, you need to add the header using the chosen chromosome size, here I’m using a region from chr18 as an example.

bwOutput = pyBigWig.open('output.bw','w')
bwOutput.addHeader([('chr18',80373285)]) # chromosome size

for x in bwIn.intervals('chr18',62926563,63516911):
    bwOutput.addEntries(['chr18'],[x[0]],ends=[x[1]],values=[x[2]])

bwOutput.close()