05/15/202010/21/2021 by bioinfocore

Unix skip/remove the first or last line

Unix
Unix

Skip or remove the first line:

tail -n +2 file.txt

Skip the last line:

head -n -1 file.txt

Source:
https://unix.stackexchange.com/questions/55755/print-file-content-without-the-first-and-last-lines

05/05/202010/21/2021 by bioinfocore

Python run loops in parallel

When no need to return anything:

from joblib import Parallel, delayed
import multiprocessing

# Number of cores available to use
num_cores = multiprocessing.cpu_count()

# If your function takes only 1 variable
def yourFunction(input):
    # anything in your loop
    return XXX

Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)


# If your function taking more than 1 variable
def yourFunction(input1, input2):
    # anything in your loop
    return XXX

Parallel(n_jobs=num_cores)(delayed(yourFunction)(input1, input2) for input1 in list1 for input2 in list2)

When need to return things, simply point it to a variable, it will be saved as a list:

results = Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)

When need to return data.frame and later concatenate together, using mp.Pool

import multiprocessing as mp
with mp.Pool(processes = num_cores-1) as pool:
    resultList = pool.map(yourFunction, argvList))

results_df = pd.concat(resultList)

Source:
https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop
https://blog.dominodatalab.com/simple-parallelization/
https://stackoverflow.com/questions/36794433/python-using-multiprocessing-on-a-pandas-dataframe

05/05/202010/21/2021 by bioinfocore

Unix remove repeated rows

Unix
Unix

One simple line to remove repeated rows from a txt file:

awk '!seen[$0]++' fileIn.txt > fileOut.txt

05/04/202010/21/2021 by bioinfocore

Unix regex grep extract substring

Unix
Unix

Using grep to extract substrings:

grep -oP 'G*_\K(.+)(?=.bw)'

# \K defines the beginning; (?=) for the string end.
# eg. extract bigwig file names
ls -lah folder/* | cut -d' ' -f 10 | grep -E 'bw' | grep -oP 'G*_\K(.+)(?=.bw)'

Source:
https://unix.stackexchange.com/questions/437405/opposite-of-k-to-keep-the-stuff-right

04/28/202010/21/2021 by bioinfocore

Pandas rename one column or one index

Python
Pandas, Python

Rename only one or a few pandas dataframe columns:

df.rename(columns={"A": "a", "B": "c"}, inplace=True)

Rename one or a few index:

df.rename(index={"A": "a", "B": "c"}, inplace=True)

Source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

04/22/202010/21/2021 by bioinfocore

Unix store a list of files into an array variable

Unix
Unix

Store a list of file into an array variable, and loop through the list:

files=($(ls -lah yourFolder/* | cut -d' ' -f X)) # X depend on which column is your file name, usually 14 in a local computer, but might differ.
for item in "${files[@]}"
do
  echo $item
done

Or read the list from file then loop:

while read sample;
do
    sample_list="$sample_list $sample"
done < sampleList.txt

for sample in $tissue_list
do
    echo $sample
done

Source:
https://stackoverflow.com/questions/9954680/how-to-store-directory-files-listing-into-an-array

04/22/202010/21/2021 by bioinfocore

Python list files from a directory

Use os.listdir(), listing files in current directory:

import os
arr = os.listdir()
print(arr)

Use glob, listing by regular expression:

import glob

listFiles = []
for file in glob.glob("*.txt"):
    listFiles.append(file)

Source has an excellent answer to this question, further usage can be found in the source.

Source:
https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory

04/17/202010/21/2021 by bioinfocore

Pandas add an empty row or column to a dataframe with index

Python
Pandas, Python

Add empty row with or without name:

df.append(pd.Series(name='NameOfNewRow')) # name the new row
df.append(pd.Series(), ignore_index=True) # not name the new row

Add empty column:

df['new'] = pd.Series()

Source:
https://stackoverflow.com/questions/39998262/append-an-empty-row-in-dataframe-using-pandas
https://stackoverflow.com/questions/16327055/how-to-add-an-empty-column-to-a-dataframe

04/16/202010/21/2021 by bioinfocore

Python NumPy replace nan in array to 0 or a number

Python
numpy, Python

Replace nan in a numpy array to zero or any number:

a = numpy.array([1,2,3,4,np.nan])

# if copy=False, the replace inplace, default is True, it will be changed to 0 by default
a = numpy.nan_to_num(a, copy=True) 

# if you want it changed to any number, eg. 10.
numpy.nan_to_num(a, copy=False, nan=10)

Replace inf or -inf with the most positive or negative finite floating-point values or any numbers:

a = numpy.array([1,2,3,4,np.inf])

# change to the most positive or finite floating-point value by default
a = numpy.nan_to_num(a, copy=True)

# if you want it changed to any number, eg. 10.
a = numpy.nan_to_num(a, copy=True, posinf=10)

# if you want it changed to any number, eg. 10., same goes to neginf
a = numpy.nan_to_num(a, copy=True, posinf=10, neginf=-10)

The parameter posinf and neginf only works when your numpy version is equal or higher than 1.17.

Source:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html

04/16/202010/21/2021 by bioinfocore

Delete a series of jobs in slurm

Unix
slurm, Unix

Delete a series of jobs using their job id. Change xxxxxxx to the start job id and the end job id in the script. (don’t worry about the jobs in between that is not yours, you don’t have the authorization to delete them, they will be skipped automatically)

./delJobs.sh

The script can be download here:
https://gist.github.com/fa5a9bdfa9192339259100019afcee0a.git

bioinfo core

Index & solution of bioInfo utilities

Author / bioinfocore