0

I have a matrix of dimensions 183,223,040x4 with the variables showed below. There are 140 different values in 'REG', and 1145 different values of both 'SAMAC' and 'SAMAC.1'

I want to iterate over REG to get 140 matrices of size 1145*1145, with the right 'VALUE' in it.

I have tried the following:

-loop over countries

-create empty matrix 1145*1145, indexed with SAMAC and with column names SAMAC.1

-go line by line of the current dataframe

-check the value of SAMAC (rows) and SAMAC.1 (columns)

-locate SAMAC and SAMAC.1 in the empty matrix and assigned the corresponding VALUE

import pandas as pd
import dask.dataframe as dd

all_sam=dd.read_csv(r'C:\GP2\all_sams_trial.csv',skiprows=1)

all_sam.head()
   SAMAC SAMAC.1  REG  Value
0  m_pdr   m_pdr  aus    0.0
1  m_wht   m_pdr  aus    0.0
2  m_gro   m_pdr  aus    0.0
3  m_v_f   m_pdr  aus    0.0
4  m_osd   m_pdr  aus    0.0

countries=list(all_sam["REG"].unique().compute())
col_names=list(all_sam["SAMAC"].unique().compute())

for country in countries:
    df=pd.DataFrame(0,index=col_names,columns=col_names)

    sam=all_sam[all_sam["REG"]==country].compute()

    for index,row in sam.iterrows():
        row_index=str(row["SAMAC"])
        col_index=str(row["SAMAC.1"])
        df.loc[row_index,col_index]=row['Value']
        print(index)

    df.to_csv(country+"_SAM.csv")

The problem is that it takes way to long to compute (around 2 days). Is there a way to speed this up?

Cuisilopez
  • 39
  • 1
  • 7

1 Answers1

0

Update 1: After understanding OP's problem of slow computation because of large size of dataframe, here's the update.

  1. Check the dtypes of columns using all_sam.dtypes and the size (in Mb) of your dataframe using:

    all_sam.memory_usage(deep=True) / 1024 ** 2
    
  2. Consider changing the column name 'SAMAC.1' to 'SAMAC_1' as it could cause error in the following lines. Before processing change the dtypes of 'REG', 'SAMAC' and 'SAMAC_1' to 'categorical':

    all_sam.REG = all_sam.REG.astype('category')
    all_sam.SAMAC = all_sam.SAMAC.astype('category')
    all_sam.SAMAC_1 = all_sam.SAMAC_1.astype('category')
    
  3. Depending on your requirement, you can downcast the dtype of the 'Value' column to float16, int16, int8, etc. using the below code:

    all_sam.Value = all_sam.Value.astype('float16')
    
  4. Check the size again.

    all_sam.memory_usage(deep=True) / 1024 ** 2
    

Hopefully, this will enable faster computation.

Ref: towardsdatascience.com

I have taken a small example dataframe to put up a solution to your problem.

import pandas as pd
import numpy as np

df = pd.DataFrame( {'REG':['A','A','A','A','A','A','B','B','B','B','B','B'], 'SAMAC1':['a','a','a','b','b','b','c','c','c','d','d','d'], 'SAMAC':['p','q','r','p','q','r','p','q','r','p','q','r'], 'value':[0,0,0,0,0,0,0,0,0,0,0,0]})
array_ = df[['REG','SAMAC1','SAMAC']].values.transpose()
index = pd.MultiIndex.from_arrays(array_, names=('REG', 'SAMAC1','SAMAC'))
df2 = df['value']
df2.index=index
country_labels = df2.index.get_level_values(0)
country_unique = country_labels.unique()
result_arr = []
for c in country_unique:
    df3 = df2[df2.index.get_level_values(0) == c]
    result_arr.append(df3.unstack().values)
result_arr = np.array(result_arr)
print(result_arr.shape)

Output: (2,2,3)

  • Many thanks @jatin! The time issue appears to remain however. Each step takes very long (simply running array_ = df[['REG','SAMAC1','SAMAC']].values.transpose() takes about 12 minutes). This is because the initial matrix is very big – Cuisilopez Mar 25 '19 at 14:02
  • Would you mind sharing your computer specs? Thanks :) – Jatin Verma Mar 25 '19 at 15:31
  • Can you run this command to display the storage memory usage by your dataframe? all_sam.memory_usage(deep=True)/ 1024 ** 2 – Jatin Verma Mar 25 '19 at 16:27
  • Loading the dataframe with pandas take about 10mins. And my computer is OK (8GB RAM, Intel I5-5th generation). The csv from which the data is loaded weight 4.4GB, and once loaded the dataframe: `all_sam.memory_usage(deep=True)/1024**2` `Out[4]:` `Index 0.000076` `SAMAC 11126.227417` `SAMAC.1 11126.227417` `REG 10484.106445` `Value 1397.880859` – Cuisilopez Mar 26 '19 at 09:42
  • I am afraid it is not working, as a "memory error" arises. – Cuisilopez Mar 26 '19 at 15:07
  • I realize that using dask is more efficient. – Jatin Verma Mar 26 '19 at 17:39
  • How would that be implemented? – Cuisilopez Mar 27 '19 at 19:12
  • A bit busy these days. Try the following after renaming 'SAMAC.1' to 'SAMAC_1'. Will post complete answer later. `import dask.dataframe as dd` `all_sam=dd.read_csv(r'C:\GP2\all_sams_trial.csv',skiprows=1)` `all_sam.REG = all_sam.REG.astype('category')` `all_sam.SAMAC = all_sam.SAMAC.astype('category')` `all_sam.SAMAC_1 = all_sam.SAMAC_1.astype('category')` `all_sam.Value = all_sam.Value.astype('float16')` `all_sam=all_sam.compute()` `all_sam.memory_usage(deep=True) / 1024 ** 2` and then maybe you can try following my above answer. – Jatin Verma Mar 29 '19 at 00:24