0

My code is:

import wrds
db = wrds.Connection() 
compustatq = db.raw_sql("""
              select *
              from comp.fundq 
              """)

compustatq.to_csv('path.csv',index=False)

The problem is that the dataset is so large that I run out of RAM memory. I was therefore thinking if it is possible to download/save the data directly on the disk? Hence, avoiding the 2 step procedure of downloading into a dataframe and then exporting to a csv file.

Chris
  • 433
  • 4
  • 17
  • 1
    you could save result of the query directly to a csv using the -e command with mysql on the command line. Also if the data is too big, try using chunking and store it in multiple csv s or append on the same csv.... also where are you running out of memory ? are you sure it is because the query is too big ? – skybunk Jul 04 '18 at 19:45
  • @skybunk yes, I am sure it is too big. We are talking many many gigabytes. Could you provide a code example where you use chunking on my example? – Chris Jul 05 '18 at 06:10
  • This is utterly inefficient, I don't see a reason to use python as a data exporter when you can use various sql client tools to format and export to csv file directly without worrying about memory limit. – mootmoot Jul 05 '18 at 10:01
  • When using the WRDS library you don't have many options for the choice of program. – Chris Jul 05 '18 at 10:14

1 Answers1

0

you can use csv library to do it. it's a python built-in library. I don't known what type is your compustatq. but find your way to read it by each row.

use it like this:

import csv

writer = csv.writer(open('path.csv', 'wb+'), dialect='excel')
for row in compustatq:
    writer.writerow([str(v) for v in row])

ref: csv


I just read the source code of wrds.Connection.raw_sql. it tells your compustatq is an instance of pandas.DataFrame. So you can still use the method to_csv but in a slightly different way:

f = open('path.csv', 'wb+')
for i in xrange(len(compustatq)):
    f.write(df[i:i+1].to_csv(header=i==0))
f.close()
Yun Luo
  • 1,506
  • 1
  • 9
  • 10
  • Thank you the help. I am not sure if I can do this? My problem atm. is that i run out of memory before all the data is stored in compustatq which is a dataframe btw. So instead of storing the data in a temporary variable I would like to just write it to csv directly. – Chris Jul 05 '18 at 06:56
  • @Chris if you write the csv file row by row, like my code shows, you will overcome the memory error. By doing so you don't have to create a big csv file in memory to write to disk, you write directly to disk. – Yun Luo Jul 05 '18 at 08:08
  • But I get an error when i try to use your code. the compustatq part of your code should in effect be replaced by the SQL expression (db.raw_sql).. or am I missing something? – Chris Jul 05 '18 at 10:14
  • @Chris I didn't known what exact type of `compustatq` is, so I assumed that it can be treat like a nested iterator(like a python db-api cursor object). You got an error when doing that so apparently it's not a cursor like object. I think you need to find a way to read rows in it one by one. Try read related API documents of it, or try do a `dir(compustatq)` to see what attributes seem like iterable. guess `rows` maybe? – Yun Luo Jul 05 '18 at 11:07
  • I am clueless what to do. – Chris Jul 05 '18 at 18:07