04/16/202010/21/2021 by bioinfocore

Slurm system multiple job submission template

Blogs, Unix
Blogs, slurm, Unix

They are many ways to submit Slurm jobs in parallel, here I will share the one that I used the most. This template can be looped through a list of entries and submit them all at once. It is especially practical when you need to run hundreds of samples at the same time.

Pay attention to the administrative limits superimposed by your admin, 500 jobs are usually the limit they gave us.
You can loop within your slurm submission script to request multiple sessions or parallel within your code, but when dealing with large number of samples, I like my way better since I have better control over individual jobs and combining with parallel within each of those sections will powers it up even more). If one node mysteriously fails (which can happen especially when you run hundreds of samples), I can easily monitor which one and resubmit it. Please feel free to choose whatever you like, whichever way works for you should be the best way.

You will need two files, one is the loop function, another is your slurm template and here is the usage:

– Have your sample list as a txt file with one column containing your sample names, in this template it is noted as sampleList.txt;
– Have your yourSlurmScript.sh composed well, replace places where your sample name will go with “Z”. (you can use a character that is not present in your yourSlurmScript.sh, I find that capitalized “Z” never present in my code, “X” is also a common choice)
– Put your yourSlurmScript.sh file name into the batchSubmit.sh script, and run as below:

./batchSubmit.sh. # you can change the name to whatever you want

1. Loop function, batchSubmit.sh:

2. Prepare yourSlurmScript.sh

Something very important here, ALWAYS rsync your files into your node assigned tmp folder and run your job there, don’t use cp especially when your jobs are “heavy”. Or I promise you your server admin will ask you out for a serious talk…

The two above scripts can be download here:
https://gist.github.com/b533a6151d8fb607a51b397ad0eb2b2c.git
https://gist.github.com/f51685c1c6277de8785374b09cffb5b5.git

04/16/202010/21/2021 by bioinfocore

Pickle UnicodeDecodeError incompatible between python2 and python3

When loading a pickle file saved using python2 and reload it into python3, you might get such errors:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position X: ordinal not in range(128)

This is due to an incompatible issue between python2 and python3, below is the easiest way to fix it, by adding encoding='latin1':

pickle.load(file, encoding='latin1')

There are definitely other ways, but this is the simplest one.

Source:
https://stackoverflow.com/questions/11305790/pickle-incompatibility-of-numpy-arrays-between-python-2-and-3

04/15/202010/21/2021 by bioinfocore

Pandas filter a dataframe by the sum of rows or columns

Python
Pandas, Python

Filter a dataframe by the sum of rows:

df = df[df.sum(axis=1) > 0]
df = df.loc[df.sum(axis=1) > 0,:]

Filter by sum of columns:

df = df.loc[:,df.sum() > 0]
df = df.loc[:,df.sum(axis=0) > 0]

Source:
https://stackoverflow.com/questions/40425484/filter-dataframe-in-pandas-on-sum-of-rows

04/14/202010/21/2021 by bioinfocore

Pandas remove rows or columns with null/nan/missing values

Python
Pandas, Python

Remove rows with nan/null/missing values:

df = df.dropna(axis=0, how='any') # Remove if any value is na
df = df.dropna(axis=0, how='all') # Remove if all values are na

Remove columns with nan/null/missing values:

df = df.dropna(axis=1, how='any') # Remove if any value is na
df = df.dropna(axis=1, how='all') # Remove if all values are na

Defaut remove is inplace=False, if you want to remove inplace, add inplace=True

Source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

04/13/202010/21/2021 by bioinfocore

Change Unix character encoding

Unix
html, Unix

This happens usually when you transferring files between systems, for example, “scp” or “rsync” file from your local machine to a Linux server. The difference will show up when you have special characters (eg. ø, ó, ä … ) in your file, especially when your file is an HTML file, there will be garbled code showing up.

iconv -f iso-8859-1 -t utf-8 input.html > fixed_input.html
mv fixed_input.html input.html

Sometimes even if you have made sure that locally your file was encoded in utf-8, the transfer will still force recognize it as in iso-8859-1. This happens.

04/09/202010/21/2021 by bioinfocore

Sort chromosome

Unix
bedfile, Unix

There are two rules that you might want to sort your chromosome by:
– Sort in a case-sensitive lexicographical order (eg. chr1, chr10, chr11 …), usually required by bedtools, UCSC utilities (eg. bedGraphToBigWig etc.) and so on;
– Or sort in a numeric order for other purpose (eg. chr1, chr2, chr3 …).

Sort in lexicographical order:

sort -k1,1 -k2,2n input.bed > sorted.bed
sortBed input.bed > sorted.bed # bedtools function
LC_COLLATE=C sort -k1,1 -k2,2n input.bedgraph > sorted.bedgraph

Sort in numeric order:

./sort_chr.sh input.bed

Download the code from here:
https://github.com/bioinfocore/bashCore/blob/master/sort_chr.sh

Source:
https://www.biostars.org/p/150036/
http://seqanswers.com/forums/showthread.php?t=63932

04/08/202010/21/2021 by bioinfocore

Unix wait or delay before next command

Unix
Unix

To avoid memory surge in unix loops, to wait/delay/sleep for a certain among of time:

sleep 2s # seconds (default)
sleep 2m # minutes
sleep 2h # hours
sleep 2d # days

To embed in loops:

for sample in "${sample_list[@]}"
do
(
    yourScript_1.sh "${sample}"
    wait
    yourScript_2.sh "${sample}"
)&
    sleep 10m
done

Source: https://www.cyberciti.biz/faq/linux-unix-sleep-bash-scripting/

04/06/202010/21/2021 by bioinfocore

Unix count how many columns in a csv or txt table

Unix
Unix

head -n1 input.csv | grep -o "," | wc -l
head -n1 input.csv | grep -o " " | wc -l

Source: https://stackoverflow.com/questions/5761212/count-number-of-columns-in-bash

04/02/202010/21/2021 by bioinfocore

Unix read command output into a variable

Unix
Unix

Add $() to your command, as the example below

var=$(head -n 1 inputFile)

04/01/202010/21/2021 by bioinfocore

Pandas if any NA in the DataFrame

Python
Pandas, Python

df.isnull().any().any()

bioinfo core

Index & solution of bioInfo utilities

Author / bioinfocore