To run Unix command in python:
import os
os.system('your unix code')
os.system('ls')
Source:
https://code.tutsplus.com/articles/how-to-run-unix-commands-in-your-python-program–cms-25926
To run Unix command in python:
import os
os.system('your unix code')
os.system('ls')
Source:
https://code.tutsplus.com/articles/how-to-run-unix-commands-in-your-python-program–cms-25926
Remove rows with the same index, only keep one first or last of them.
df = df.loc[~df.index.duplicated(keep='first')]
df = df.loc[~df.index.duplicated(keep='last')]
When no need to return anything:
from joblib import Parallel, delayed
import multiprocessing
# Number of cores available to use
num_cores = multiprocessing.cpu_count()
# If your function takes only 1 variable
def yourFunction(input):
# anything in your loop
return XXX
Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)
# If your function taking more than 1 variable
def yourFunction(input1, input2):
# anything in your loop
return XXX
Parallel(n_jobs=num_cores)(delayed(yourFunction)(input1, input2) for input1 in list1 for input2 in list2)
When need to return things, simply point it to a variable, it will be saved as a list:
results = Parallel(n_jobs=num_cores)(delayed(yourFunction)(input) for input in list)
When need to return data.frame and later concatenate together, using mp.Pool
import multiprocessing as mp
with mp.Pool(processes = num_cores-1) as pool:
resultList = pool.map(yourFunction, argvList))
results_df = pd.concat(resultList)
Source:
https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop
https://blog.dominodatalab.com/simple-parallelization/
https://stackoverflow.com/questions/36794433/python-using-multiprocessing-on-a-pandas-dataframe
Rename only one or a few pandas dataframe columns:
df.rename(columns={"A": "a", "B": "c"}, inplace=True)
Rename one or a few index:
df.rename(index={"A": "a", "B": "c"}, inplace=True)
Source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
Use os.listdir()
, listing files in current directory:
import os
arr = os.listdir()
print(arr)
Use glob
, listing by regular expression:
import glob
listFiles = []
for file in glob.glob("*.txt"):
listFiles.append(file)
Source has an excellent answer to this question, further usage can be found in the source.
Source:
https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
Add empty row with or without name:
df.append(pd.Series(name='NameOfNewRow')) # name the new row
df.append(pd.Series(), ignore_index=True) # not name the new row
Add empty column:
df['new'] = pd.Series()
Source:
https://stackoverflow.com/questions/39998262/append-an-empty-row-in-dataframe-using-pandas
https://stackoverflow.com/questions/16327055/how-to-add-an-empty-column-to-a-dataframe
Replace nan in a numpy array to zero or any number:
a = numpy.array([1,2,3,4,np.nan])
# if copy=False, the replace inplace, default is True, it will be changed to 0 by default
a = numpy.nan_to_num(a, copy=True)
# if you want it changed to any number, eg. 10.
numpy.nan_to_num(a, copy=False, nan=10)
Replace inf or -inf with the most positive or negative finite floating-point values or any numbers:
a = numpy.array([1,2,3,4,np.inf])
# change to the most positive or finite floating-point value by default
a = numpy.nan_to_num(a, copy=True)
# if you want it changed to any number, eg. 10.
a = numpy.nan_to_num(a, copy=True, posinf=10)
# if you want it changed to any number, eg. 10., same goes to neginf
a = numpy.nan_to_num(a, copy=True, posinf=10, neginf=-10)
The parameter
posinf
andneginf
only works when your numpy version is equal or higher than 1.17.
Source:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html
When loading a pickle file saved using python2 and reload it into python3, you might get such errors:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position X: ordinal not in range(128)
This is due to an incompatible issue between python2 and python3, below is the easiest way to fix it, by adding encoding='latin1'
:
pickle.load(file, encoding='latin1')
There are definitely other ways, but this is the simplest one.
Filter a dataframe by the sum of rows:
df = df[df.sum(axis=1) > 0]
df = df.loc[df.sum(axis=1) > 0,:]
Filter by sum of columns:
df = df.loc[:,df.sum() > 0]
df = df.loc[:,df.sum(axis=0) > 0]
Source:
https://stackoverflow.com/questions/40425484/filter-dataframe-in-pandas-on-sum-of-rows
Remove rows with nan/null/missing values:
df = df.dropna(axis=0, how='any') # Remove if any value is na
df = df.dropna(axis=0, how='all') # Remove if all values are na
Remove columns with nan/null/missing values:
df = df.dropna(axis=1, how='any') # Remove if any value is na
df = df.dropna(axis=1, how='all') # Remove if all values are na
Defaut remove is
inplace=False
, if you want to remove inplace, addinplace=True
Source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html