Python matplotlib: all about fonts

Fonts output into pdf as text, not shape, to be recognized in Illustrator:

import matplotlib
matplotlib.rcParams['pdf.fonttype'] = 42
matplotlib.rcParams['ps.fonttype'] = 42

Change fonts style for all, change the default:

import matplotlib as mpl
matplotlib.rcParams['font.family'] = 'Arial'

Change fonts style for labels, tick marks, and titles: (for each plot, not changing the default)

plt.xlabel('XXXX', fontsize=12, fontname='Arial')
plt.ylabel('XXXXX', fontsize=12, fontname='Arial')

plt.xticks(fontname='Arial')
plt.yticks(fontname='Arial')

plt.title('XXXXX', fontsize=14, fontname='Arial')

Change legend font style: (for each plot, not changing the default)

plt.legend(prop=matplotlib.font_manager.FontProperties(family='Arial', size=12, weight='bold', style='normal'))

Source:
https://jonathansoma.com/lede/data-studio/matplotlib/exporting-from-matplotlib-to-open-in-adobe-illustrator/
https://stackoverflow.com/questions/20753782/default-fonts-in-seaborn-statistical-data-visualization-in-ipython
https://stackoverflow.com/questions/47112522/matplotlib-how-to-set-legends-font-type
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html

Liftover bam files

The most straightforward way is using CrossMap.

Taking from hg19 to hg38 as example:

pip install CrossMap

CrossMap.py bam -a hg19ToHg38.over.chain input.bam output
#.bam extension will be added automatically

genome liftover chain files can be downloaded here: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/ (change according to your needs)

It is suggested to always use ‘-a’ option according to the CrossMap website.

Source:
http://crossmap.sourceforge.net/#convert-bam-cram-sam-format-files

ENCODE ATAC-seq analyzing pipeline hands-on tutorial

This pipeline is very practical, though the installation and usage is a hassle, once you successfully install it and prepared all the files, just sit and wait for the result. It will take in ATAC-seq fastq reads file, and output mapped bam, peak calling, etc. Details about the pipeline can be found here.

Installation

It’s recommended to install the pipeline using Conda, and I think it is the easiest way, here listed detail steps of installation, please follow it step-by-step, I copy/paste important steps here and point out the possible errors. (The whole instruction is for the Python3 version)

If you install the Conda environment at some time point, the package croo is outdated, please updated it within your environment.

conda uninstall croo
pip install croo==0.3.4

IMPORTANT: DO NOT USE A GLOBALLY INSTALLED CONDA. INSTALL YOUR OWN MINICONDA3 UNDER YOUR HOME DIRECTORY BY FOLLOWING THE INSTRUCTION HERE.
Since fastq files are large and the pipeline is computational heavy, the installation instruction is for installation on clusters which is Linux.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-4.6.14-Linux-x86_64.sh

# Use default except for the following two questions:

Do you accept the license terms? [yes|no]
[no] >>> yes

Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes

IMPORTANT: Close your session and re-login. If you skip this step then pipeline’s Conda environment will be messed up with base Conda environment.

# Disable auto-activation of base Conda environment.
conda config --set auto_activate_base false

IMPORTANT: Close your session and re-login.

# Install the environment
bash scripts/uninstall_conda_env.sh  # uninstall it for clean-install
bash scripts/install_conda_env.sh

After everything is finished, below message will show up.

=== All done successfully ===

Prepare files before running the pipeline (slurm)

– JSON file
– ~/.caper/default.conf

– backends folder
– atac.wdl

– atac.croo.v4.json

JSON file is a config file where you specify all genomic data files, parameters and metadata for running the pipeline. Please see this link for a detailed document, and here are the short version and full version of the template.

IMPORTANT: ALWAYS USE ABSOLUTE PATHS.

### A screenshot of how the JSON file looks like (not complete) ###
# Here are the lines you need to modify in the JSON file.

"atac.genome_tsv" : "https://storage.googleapis.com/encode-pipeline-genome-data/hg38_caper.tsv",
"atac.blacklist": "/your/path/ENCFF419RSJ_blackList.bed.gz",
"atac.paired_end" : true,
"atac.fastqs_rep1_R1" : [ "rep1_R1_L1.fastq.gz", "rep1_R1_L2.fastq.gz", "rep1_R1_L3.fastq.gz" ],
"atac.fastqs_rep1_R2" : [ "rep1_R2_L1.fastq.gz", "rep1_R2_L2.fastq.gz", "rep1_R2_L3.fastq.gz" ],
"atac.fastqs_rep2_R1" : [ "rep2_R1_L1.fastq.gz", "rep2_R1_L2.fastq.gz" ],
"atac.fastqs_rep2_R2" : [ "rep2_R2_L1.fastq.gz", "rep2_R2_L2.fastq.gz" ],
"atac.title" : "Your atac-seq title",
"atac.description" : "Your atac-seq description",

To save you the trouble, if you are mapping to hg38, just use this link in the above template for atac.genome_tsv
The blacklist file can be downloaded from ENCODE (hg38). [link]
As mentioned above, all paths should use the absolute path.

~/.caper/default.conf file is the config file for caper, please see detailed document [link] for other cluster systems, here I will show you how to edit it for slurm.
My logic is, I have a slurm command submitted for each of my samples, after being assigned a computing node, the computing node is my new ‘local’, all my input/tmp/output files and the actual computing happen in the assigned $TMPDIR folder. Below shows how exactly the file looks like, just change your home directory address.

backend=local

# DO NOT use /tmp here
# Caper stores all important temp files and cached big data files here
# If not defined, Caper will make .caper_tmp/ on your local output directory
# which is defined by out-dir, --out-dir or $CWD
# Use a local absolute path here
tmp-dir=    # <-- left blank intentionally

cromwell=/your/home/directory/.caper/cromwell_jar/cromwell-47.jar
womtool=/your/home/directory/.caper/womtool_jar/womtool-47.jar

Belows files are all in the ATAC-seq pipeline Github website, you can just download the whole thing. But just in case you install the pipeline using Conda and you can’t find it, I added the individual link as well.
backends folder can be downloaded at this link.
atac.wdl file is the whole pipeline file which can be downloaded at this link.
atac.croo.v4.json file is a JSON file to define how to output results, it can be downloaded here.

Some times error occurs because no backends folder specified or cromwell is not installed properly.

Running the pipeline and example slurm submission file

Finally, after everything is in place, here is the example slurm for running one sample, you can refer to this blog for how to loop through multiple samples.

Here are some notes for the slurm:
caper init local is a must, as I explained previously. Many people encounter errors when using the slurm is because of this.
– Always rsync all the files into $TMPDIR. This will make your job run must faster and it won’t cause constant writing between the local node and the computing node to crash the whole system.
– Please carefully specify the intermediate and final path in both caper and croo commands. It is always in sub-folders in $TMPDIR except for the final output defined by --out-dir $SLURM_SUBMIT_DIR/pipelineOutput/Z.
Z stands for your sample name, change as needed.

And the last step is submitting your slurm command file.

sbatch atacSeqPipeline.sh

Source:
https://www.encodeproject.org/atac-seq/#overview
https://github.com/ENCODE-DCC/atac-seq-pipeline
https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/install_conda.md
https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md
https://encode-dcc.github.io/wdl-pipelines/install.html


Funseq2 hands-on tutorial

Only the most essentials are kept, mostly explanation of installation errors and how to fix them. For detailed instructions, please refer to Funseq2 Protocol.
It’s recommended to install things to a designated Conda environment, for some tools, I will only list Conda installation command.

Dependances to install first

TPMpvalue [link]
– Perl package Parallel::ForkManager
[link]
– bedtools/tabix/bigwigAverageOverBed
– VAT (snpMapper, indelMapper Module) [link]
– sed/awk/grep
(those should have been pre-installed already)

TPMpvalue
Download the TFM-Pvalue.tar.gz from its website, tar -xf TFM-Pvalue.tar.gz to decompress the package and cd into the folder. To compile the package type as follow.

tar -xf TFM-Pvalue.tar.gz
cd TFM-Pvalue
make

After compiling the tool successfully, either directly copy the compiled binary files into your /usr/bin folder, or add the path to the environmental variable as below.

export PATH=$PATH:/where/you/put/binary/files

Possible Error:
g++ -O3 -DJASPAR=1 -DPROGRAM=0 TFMpvalue.cpp Matrix.cpp ArgumentException.cpp FileException.cpp ParseException.cpp -o TFMpvalue-pv2sc
TFMpvalue.cpp: In function ‘void arguments(int, char* const*)’:
TFMpvalue.cpp:503:45: error: ‘getopt’ was not declared in this scope

Solution:
In the TFM-Pvalue folder, find the file TFPpvaue.cpp, uncomment line 16 in that file and change GetOpt.h into getopt.h.

Perl package Parallel::ForkManager
Here is the installation using Conda.

conda install -n yourEnvr -c bioconda perl-parallel-forkmanager

bedtools/tabix/bigwigAverageOverBed
Installation using Conda.

conda install -n yourEnvr -c bioconda bedtools
conda install -n yourEnvr -c bioconda tabix
conda install -n yourEnvr -c bioconda ucsc-bigwigaverageoverbed

VAT
The easiest way is to download pre-built binaries, save them into your local /usr/bin/ or add the path to your environmental variable and make them executable.

export PATH=$PATH:/where/you/put/binary/files
chmod +x snpMapper
chmod +x indelMapper # <- to make them executable

Download the Funseq2 and pre-processed data

Download the Funse2 from its newest update:
https://github.com/khuranalab/FunSeq2_DC

Download the latest data needed for Funseq2 and put everything into the data_context folder, unzip XXX.tar.gz folders within data_context
https://khuranalab.med.cornell.edu/data_DC3.html

You can also download an older build of the data set from https://khuranalab.med.cornell.edu/data.html
The content can be downloaded in a compressed folder from that page.

Prepare funseq2.sh and config.txt file to run Funseq2

In your FunSeq2_DC folder, you need to modify two files before starting: funseq2.sh and contig.txt.

### In funseq2.sh ###

# keep "data_context/user_annotations" intact
user_anno=/your/destination/of/data_context/user_annotations

### In contig.txt ###

# Change the file path first and then change the annotation files you want to use accordingly.
file_path=/your/destination/of/data_context

The destination user_anno can be specified in the running command by using the option -ua. But if you don’t want to type that every time, you can just change it in funseq2.sh.

If running on cluster and using Conda, remember to export the $TMPDIR to $PERL5LIB

export PERL5LIB=$PERL5LIB:$TMPDIR

Running FunSeq2

After all the preparation, we are finally here. To run FunSeq2 is simple.

funseq2.sh -f file -maf MAF -m <1/2> -len length_cut -inf <bed/vcf> -outf <bed/vcf> -nc -o path -g file -exp file -cls file -exf <rpkm/raw> -p int -cancer cancer_type -s score -uw -ua user_annotations_directory -db
FunSeq2 options

Source:
https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpbi.23
https://github.com/gersteinlab/FunSeq2/issues/4
http://info.gersteinlab.org/Funseq2#E._Input_files
https://anaconda.org/bioconda/perl-parallel-forkmanager
https://metacpan.org/pod/release/SZABGAB/Parallel-ForkManager-1.03/lib/Parallel/ForkManager.pm
https://anaconda.org/bioconda/bedtools
https://anaconda.org/bioconda/tabix
https://anaconda.org/bioconda/ucsc-bigwigaverageoverbed
https://github.com/khuranalab/FunSeq2_DC
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0480-5
https://science.sciencemag.org/content/342/6154/1235587.long

Extract/subset bigwig file for a given genomic region

This is a solution in python version (3.0+) using a package called pyBigWig to extract a given genomic region from a whole genome bigwig file.

Prepare your input bigwig file:

import pyBigWig

# First open bigwig file
bwIn = pyBigWig.open('input.bw')

# check bigwig file header
print(bwIn.header())

Prepare output, since your output doesn’t have a header, you need to add the header using the chosen chromosome size, here I’m using a region from chr18 as an example.

bwOutput = pyBigWig.open('output.bw','w')
bwOutput.addHeader([('chr18',80373285)]) # chromosome size

for x in bwIn.intervals('chr18',62926563,63516911):
    bwOutput.addEntries(['chr18'],[x[0]],ends=[x[1]],values=[x[2]])

bwOutput.close()

Liftover bigwig files

There are two ways to lift-over bigwig files from one genome build to another. One is using CrossMap or step by step as below. (CrossMap is almost the same to the break down steps)

CrossMap method: (taking from hg19 to hg38 as example)

pip install CrossMap

CrossMap.py bigwig hg19ToHg38.over.chain input.bw output.bw

genome liftover chain files can be downloaded here: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/ (change according to your needs)

Step by step method: (bw –> bedGraph –> liftover –> bw)

bigWigToBedGraph input.bw input.bedGraph

liftOver input.bedGraph hg19ToHg38.over.chain input_hg38.bedgraph unMapped

fetchChromSizes hg38 > hg38.chrom.sizes

LC_COLLATE=C sort -k1,1 -k2,2n input_hg38.bedgraph > input_hg38.sorted.bedgraph

bedGraphToBigWig input_hg38.sorted.bedgraph hg38.chrom.sizes output.bw

bigWigToBedGraph, liftOver, fetchChromSizes, bedGraphToBigWig are all UCSC utilities which can be installed from here: http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads

Source:
http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads
http://crossmap.sourceforge.net/#

Slurm system multiple job submission template

They are many ways to submit Slurm jobs in parallel, here I will share the one that I used the most. This template can be looped through a list of entries and submit them all at once. It is especially practical when you need to run hundreds of samples at the same time.

Pay attention to the administrative limits superimposed by your admin, 500 jobs are usually the limit they gave us.

You can loop within your slurm submission script to request multiple sessions or parallel within your code, but when dealing with large number of samples, I like my way better since I have better control over individual jobs and combining with parallel within each of those sections will powers it up even more). If one node mysteriously fails (which can happen especially when you run hundreds of samples), I can easily monitor which one and resubmit it. Please feel free to choose whatever you like, whichever way works for you should be the best way.

You will need two files, one is the loop function, another is your slurm template and here is the usage:

– Have your sample list as a txt file with one column containing your sample names, in this template it is noted as sampleList.txt;
– Have your yourSlurmScript.sh composed well, replace places where your sample name will go with “Z”. (you can use a character that is not present in your yourSlurmScript.sh, I find that capitalized “Z” never present in my code, “X” is also a common choice)
– Put your yourSlurmScript.sh file name into the batchSubmit.sh script, and run as below:

./batchSubmit.sh. # you can change the name to whatever you want

1. Loop function, batchSubmit.sh:

2. Prepare yourSlurmScript.sh

Something very important here, ALWAYS rsync your files into your node assigned tmp folder and run your job there, don’t use cp especially when your jobs are “heavy”. Or I promise you your server admin will ask you out for a serious talk…

The two above scripts can be download here:
https://gist.github.com/b533a6151d8fb607a51b397ad0eb2b2c.git
https://gist.github.com/f51685c1c6277de8785374b09cffb5b5.git