I have a matrix of dimensions 183,223,040x4 with the variables showed below. There are 140 different values in 'REG', and 1145 different values of both 'SAMAC' and 'SAMAC.1'
I want to iterate over REG to get 140 matrices of size 1145*1145, with the right 'VALUE' in it.
I have tried the following:
-loop over countries
-create empty matrix 1145*1145, indexed with SAMAC and with column names SAMAC.1
-go line by line of the current dataframe
-check the value of SAMAC (rows) and SAMAC.1 (columns)
-locate SAMAC and SAMAC.1 in the empty matrix and assigned the corresponding VALUE
import pandas as pd
import dask.dataframe as dd
all_sam=dd.read_csv(r'C:\GP2\all_sams_trial.csv',skiprows=1)
all_sam.head()
SAMAC SAMAC.1 REG Value
0 m_pdr m_pdr aus 0.0
1 m_wht m_pdr aus 0.0
2 m_gro m_pdr aus 0.0
3 m_v_f m_pdr aus 0.0
4 m_osd m_pdr aus 0.0
countries=list(all_sam["REG"].unique().compute())
col_names=list(all_sam["SAMAC"].unique().compute())
for country in countries:
df=pd.DataFrame(0,index=col_names,columns=col_names)
sam=all_sam[all_sam["REG"]==country].compute()
for index,row in sam.iterrows():
row_index=str(row["SAMAC"])
col_index=str(row["SAMAC.1"])
df.loc[row_index,col_index]=row['Value']
print(index)
df.to_csv(country+"_SAM.csv")
The problem is that it takes way to long to compute (around 2 days). Is there a way to speed this up?