How to efficiently extract unique rows from massive CSV using Python or R

Question

I have a massive CSV (1.4gb, over 1MM rows) of stock market data that I will process using R.

The table looks roughly like this. For each ticker, there are thousands of rows of data.

+--------+------+-------+------+------+
| Ticker | Open | Close | High | Low  |
+--------+------+-------+------+------+
| A      |  121 |   121 |  212 | 2434 |
| A      |   32 |    23 |   43 |  344 |
| A      |  121 |   121 |  212 | 2434 |
| A      |   32 |    23 |   43 |  344 |
| A      |  121 |   121 |  212 | 2434 |
| B      |   32 |    23 |   43 |  344 |
+--------+------+-------+------+------+

To make processing and testing easier, I'm breaking this colossus into smaller files using the script mentioned in this question: How do I slice a single CSV file into several smaller ones grouped by a field?

The script would output files such as data_a.csv, data_b.csv, etc.

But, I would also like to create index.csv which simply lists all the unique stock ticker names.

E.g.

+---------+
| Ticker  |
+---------+
| A       |
| B       |
| C       |
| D       |
| ...     |
+---------+

Can anybody recommend an efficient way of doing this in R or Python, when handling a huge filesize?

Assuming R can load the entire file, then the file size isn't really the big issue, but rather how you answer your questions. — Tim Biegeleisen, Mar 07 '18 at 01:46
If the rows are really grouped by ticker, this is trivial. Create a `csv.Reader` and a `csv.Writer`, then `for row in reader: if row[0] != last_ticker: writer.writerow(row); last_ticker = row[0]`. That's it. — abarnert, Mar 07 '18 at 01:51
Are you creating the output files `"data_a.csv", "data_b.csv"` before writing the `"index.csv"`? If so, you can just iterate through the file output names and write that to `"index.csv"`. Nothing complicated here. — serk, Mar 07 '18 at 01:51
@serk I can do, yes. Such a bozo, I didn't even think of this. (Palm-faces self) — matt, Mar 07 '18 at 01:52
instead of cropping file into multiple, you could adjust your io buffer size and only handle chunks of the file at the time while processing. When processed data could be saved into ie sqlite3 which will handle data much better then what you try to do here. — Arkadiusz Tymieniecki, Mar 07 '18 at 02:01

Jorden · Answer 1 · 2018-03-07T02:06:48.437

1

You could loop through each file, grabbing the index of each and creating a set union of all indices.

import glob

tickers = set()
for csvfile in glob.glob('*.csv'):
    data = pd.read_csv(csvfile, index_col=0, header=None)  # or True, however your data is set up
    tickers.update(data.index.tolist())

pd.Series(list(tickers)).to_csv('index.csv', index=False)

edited Mar 07 '18 at 02:06

answered Mar 07 '18 at 01:57

Jorden

653
5
11

score 0 · Answer 2 · answered Mar 07 '18 at 01:56

You can retrieve the index from the file names:

(index <- data.frame(Ticker = toupper(gsub("^.*_(.*)\\.csv",
                                           "\\1",
                                           list.files()))))
##   Ticker
## 1      A
## 2      B

write.csv(index, "index.csv")

How to efficiently extract unique rows from massive CSV using Python or R

2 Answers2