0

I read data from MariaDB using pandas read_sql. What I am trying to do is give each data frame an index with same continuation in each chunk.

Index of first chunk should 1 to 2000 and for second chunk to should 2001 to 4000. For that I have added a list while creating df. But after first iteration values comes as Nan.

query = 'select * from big_2L_csv'
chunk_size = 2000

start_point = 1
end_point = chunk_size+1
for chunk in pd.read_sql(query, conn,chunksize=chunk_size):
    indexes = list(range(start_point,end_point))
    file_data_df = pd.DataFrame(chunk,index = indexes)
    print(start_point,end_point)
    start_point += chunk_size
    end_point += chunk_size
    print(file_data_df.head(5))
    print(file_data_df.tail(5))

Output :

   PAY_AMT4  PAY_AMT5  PAY_AMT6  default payment next month  
1    1000.0       0.0    2000.0                         1.0  
2    1000.0    1000.0    5000.0                         0.0  
3    1100.0    1069.0    1000.0                         0.0  
4    9000.0     689.0     679.0                         0.0  
5    1000.0    1000.0     800.0                         0.0  


      PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  default payment next month  
1996    2000.0    1000.0    2000.0    1000.0                         0.0  
1997     504.0    2770.0     613.0      14.0                         0.0  
1998    4000.0    4500.0    4000.0    4200.0                         0.0  
1999       0.0    8000.0       0.0       0.0                         0.0  
2000       NaN       NaN       NaN       NaN                         NaN 


      PAY_AMT3  PAY_AMT4  PAY_AMT5  PAY_AMT6  default payment next month  
2001       NaN       NaN       NaN       NaN                         NaN  
2002       NaN       NaN       NaN       NaN                         NaN  
2003       NaN       NaN       NaN       NaN                         NaN  
2004       NaN       NaN       NaN       NaN                         NaN  
2005       NaN       NaN       NaN       NaN                         NaN  

There is still more data in table, but in output it shows Nan.

BetaTester
  • 25
  • 8
  • What do you hope to do after reading all chunks? Only last iteration will be assigned to `file_data_df`. – Parfait Apr 01 '22 at 19:43
  • I am reading data in the chuck, creating dataframe and storing it in the database. To avoid memory issues. – BetaTester Apr 03 '22 at 06:16

1 Answers1

0

According to docs of the pandas.DataFrame constructor, when a data object contains an existing index, the constructor will use that existing index. Therefore, by using index you are specifying what index values to use in that existing object.

Because each chunk is a rendered DataFrame and because you did not specify an index_col in pandas.read_sql, each chunk maintains the default RangeIndex starting from zero to number of rows (i.e., chunk size). So by passing a range into index argument that exceeds 0-200, you will return back all missing rows since such indices do not exist.

To fix, avoid the DataFrame constructor and simply re-assign index. Below uses enumerate to keep the start and end counters.

query = 'select * from big_2L_csv'
chunk_size = 2000

dfs = []
for i, chunk in enumerate(pd.read_sql(query, conn, chunksize=200)):
    start = chunksize*i; end = chunksize*i + len(chunk))
    print(start, end)
    chunk.index = pd.RangeIndex(start, end)
    
    dfs.append(chunk)      # APPEND CHUNKS TO LIST OF DATA FRAMES

    print(chunk.head(5))
    print(chunk.tail(5))

By the way, iteratively re-assigining indexes is not needed for pandas.concat with ignore_index to build a new RangeIndex across length of all data frames.

master_df = pd.concat(dfs, ignore_index=True)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Can you explain more on this line:- **So by passing a range into an index argument that exceeds 0-200, you will return back all empty rows since such indices do not exist.** – BetaTester Apr 03 '22 at 06:11
  • I should have wrote *return back all **missing** rows*. Since each `chunk` has a RangeIndex from 0-200 and named columns, `DataFrame()` will use them. So the `index` and `columns` arguments will not assign but subset data. Any misaligned index/column will then return as missing (i.e., all NaNs). Simply avoid `DataFrame` in this use case. – Parfait Apr 03 '22 at 14:22