I have a massive CSV (1.4gb, over 1MM rows) of stock market data that I will process using R.
The table looks roughly like this. For each ticker, there are thousands of rows of data.
+--------+------+-------+------+------+
| Ticker | Open | Close | High | Low |
+--------+------+-------+------+------+
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| B | 32 | 23 | 43 | 344 |
+--------+------+-------+------+------+
To make processing and testing easier, I'm breaking this colossus into smaller files using the script mentioned in this question: How do I slice a single CSV file into several smaller ones grouped by a field?
The script would output files such as data_a.csv
, data_b.csv
, etc.
But, I would also like to create index.csv
which simply lists all the unique stock ticker names.
E.g.
+---------+
| Ticker |
+---------+
| A |
| B |
| C |
| D |
| ... |
+---------+
Can anybody recommend an efficient way of doing this in R or Python, when handling a huge filesize?