Extract/subset bigwig file for a given genomic region

This is a solution in python version (3.0+) using a package called pyBigWig to extract a given genomic region from a whole genome bigwig file.

Prepare your input bigwig file:

import pyBigWig

# First open bigwig file
bwIn = pyBigWig.open('input.bw')

# check bigwig file header

Prepare output, since your output doesn’t have a header, you need to add the header using the chosen chromosome size, here I’m using a region from chr18 as an example.

bwOutput = pyBigWig.open('output.bw','w')
bwOutput.addHeader([('chr18',80373285)]) # chromosome size

for x in bwIn.intervals('chr18',62926563,63516911):


Python pickle save/load dictionary into a file

Sometimes you want to save a dictionary into a reusable file, and reuse it later. This is specifically useful when you need to save a color scheme for a large category and want to keep it consistent across your whole project.

Output it as a file:

import pickle

colorDicOut = open('colorOut_dic.pkl', 'wb')
pickle.dump(colorDic, colorDicOut)

with open('colorOut_dic.pkl', 'wb') as colorDicOut:
    pickle.dump(colorDic, colorDicOut)

Load it into a variable:

colorDicIn = open("colorOut_dic.pkl","rb")
colorDic = pickle.load(colorDicIn)

If UnicodeDecodeError comes up, then it is likely an issue of incompatibility between python2 and python3, please refer to this post for solutions.


Python run loops in parallel

When no need to return anything:

from joblib import Parallel, delayed
import multiprocessing

# Number of cores available to use
num_cores = multiprocessing.cpu_count()

# If your function takes only 1 variable
def yourFunction(input):
    # anything in your loop
    return XXX

Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)

# If your function taking more than 1 variable
def yourFunction(input1, input2):
    # anything in your loop
    return XXX

Parallel(n_jobs=num_cores)(delayed(yourFunction)(input1, input2) for input1 in list1 for input2 in list2)

When need to return things, simply point it to a variable, it will be saved as a list:

results = Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)

When need to return data.frame and later concatenate together, using mp.Pool

import multiprocessing as mp
with mp.Pool(processes = num_cores-1) as pool:
    resultList = pool.map(yourFunction, argvList))

results_df = pd.concat(resultList)