Extract/subset bigwig file for a given genomic region

This is a solution in python version (3.0+) using a package called pyBigWig to extract a given genomic region from a whole genome bigwig file.

Prepare your input bigwig file:

import pyBigWig

# First open bigwig file
bwIn = pyBigWig.open('input.bw')

# check bigwig file header
print(bwIn.header())

Prepare output, since your output doesn’t have a header, you need to add the header using the chosen chromosome size, here I’m using a region from chr18 as an example.

bwOutput = pyBigWig.open('output.bw','w')
bwOutput.addHeader([('chr18',80373285)]) # chromosome size

for x in bwIn.intervals('chr18',62926563,63516911):
    bwOutput.addEntries(['chr18'],[x[0]],ends=[x[1]],values=[x[2]])

bwOutput.close()

Python pickle save/load dictionary into a file

Sometimes you want to save a dictionary into a reusable file, and reuse it later. This is specifically useful when you need to save a color scheme for a large category and want to keep it consistent across your whole project.

Output it as a file:

import pickle

colorDicOut = open('colorOut_dic.pkl', 'wb')
pickle.dump(colorDic, colorDicOut)
colorDicOut.close()

with open('colorOut_dic.pkl', 'wb') as colorDicOut:
    pickle.dump(colorDic, colorDicOut)

Load it into a variable:

colorDicIn = open("colorOut_dic.pkl","rb")
colorDic = pickle.load(colorDicIn)

If UnicodeDecodeError comes up, then it is likely an issue of incompatibility between python2 and python3, please refer to this post for solutions.

Source:
https://stackoverflow.com/questions/11218477/how-can-i-use-pickle-to-save-a-dict

Python run loops in parallel

When no need to return anything:

from joblib import Parallel, delayed
import multiprocessing

# Number of cores available to use
num_cores = multiprocessing.cpu_count()

# If your function takes only 1 variable
def yourFunction(input):
    # anything in your loop
    return XXX

Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)


# If your function taking more than 1 variable
def yourFunction(input1, input2):
    # anything in your loop
    return XXX

Parallel(n_jobs=num_cores)(delayed(yourFunction)(input1, input2) for input1 in list1 for input2 in list2)

When need to return things, simply point it to a variable, it will be saved as a list:

results = Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)

When need to return data.frame and later concatenate together, using mp.Pool

import multiprocessing as mp
with mp.Pool(processes = num_cores-1) as pool:
    resultList = pool.map(yourFunction, argvList))

results_df = pd.concat(resultList)

Source:
https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop
https://blog.dominodatalab.com/simple-parallelization/
https://stackoverflow.com/questions/36794433/python-using-multiprocessing-on-a-pandas-dataframe