Save WRDS SQL query to file in python

Question

My code is:

import wrds
db = wrds.Connection() 
compustatq = db.raw_sql("""
              select *
              from comp.fundq 
              """)

compustatq.to_csv('path.csv',index=False)

The problem is that the dataset is so large that I run out of RAM memory. I was therefore thinking if it is possible to download/save the data directly on the disk? Hence, avoiding the 2 step procedure of downloading into a dataframe and then exporting to a csv file.

you could save result of the query directly to a csv using the -e command with mysql on the command line. Also if the data is too big, try using chunking and store it in multiple csv s or append on the same csv.... also where are you running out of memory ? are you sure it is because the query is too big ? — skybunk, Jul 04 '18 at 19:45
@skybunk yes, I am sure it is too big. We are talking many many gigabytes. Could you provide a code example where you use chunking on my example? — Chris, Jul 05 '18 at 06:10
This is utterly inefficient, I don't see a reason to use python as a data exporter when you can use various sql client tools to format and export to csv file directly without worrying about memory limit. — mootmoot, Jul 05 '18 at 10:01
When using the WRDS library you don't have many options for the choice of program. — Chris, Jul 05 '18 at 10:14

Yun Luo · Accepted Answer · 2018-07-06T06:32:17.980

0

you can use csv library to do it. it's a python built-in library. I don't known what type is your compustatq. but find your way to read it by each row.

use it like this:

import csv

writer = csv.writer(open('path.csv', 'wb+'), dialect='excel')
for row in compustatq:
    writer.writerow([str(v) for v in row])

ref: csv

I just read the source code of wrds.Connection.raw_sql. it tells your compustatq is an instance of pandas.DataFrame. So you can still use the method to_csv but in a slightly different way:

f = open('path.csv', 'wb+')
for i in xrange(len(compustatq)):
    f.write(df[i:i+1].to_csv(header=i==0))
f.close()

edited Jul 06 '18 at 06:32

answered Jul 05 '18 at 06:43

Yun Luo

1,506
1
9
10

Thank you the help. I am not sure if I can do this? My problem atm. is that i run out of memory before all the data is stored in compustatq which is a dataframe btw. So instead of storing the data in a temporary variable I would like to just write it to csv directly. – Chris Jul 05 '18 at 06:56
@Chris if you write the csv file row by row, like my code shows, you will overcome the memory error. By doing so you don't have to create a big csv file in memory to write to disk, you write directly to disk. – Yun Luo Jul 05 '18 at 08:08
But I get an error when i try to use your code. the compustatq part of your code should in effect be replaced by the SQL expression (db.raw_sql).. or am I missing something? – Chris Jul 05 '18 at 10:14
@Chris I didn't known what exact type of `compustatq` is, so I assumed that it can be treat like a nested iterator(like a python db-api cursor object). You got an error when doing that so apparently it's not a cursor like object. I think you need to find a way to read rows in it one by one. Try read related API documents of it, or try do a `dir(compustatq)` to see what attributes seem like iterable. guess `rows` maybe? – Yun Luo Jul 05 '18 at 11:07
I am clueless what to do. – Chris Jul 05 '18 at 18:07

Save WRDS SQL query to file in python

1 Answers1