4

I have large data set from the World Bank Millenium Development goals as a CSV. The data is displayed like this:

Country Code   Country Name   Indicator
ABW            Aruba          % Forest coverage
ADO            Andorra        % Forest coverage
AFG            Afghanistan    % Forest coverage
...
ABW            Aruba          % Literacy rate
ADO            Andorra        % Literacy rate
AFG            Afghanistan    % Literacy rate
...
ABW            Aruba          % Another indicator
ADO            Andorra        % Another indicator
AFG            Afghanistan    % Another indicator

The file is currently 8.2MB. I'm going to program a web interface for this data, and I'd like to slice the data by country so I can make an ajax request so I can load an individual CSV for each country.

I'm lost on how to do this programmatically or with any tool. I don't necessarily need Python but it's what I know best. I don't necessarily need a complete script, a general pointer on how to approach this problem is appreciated.

The original data source I'm working with is located here:

http://duopixel.com/stack/data.csv

methodofaction
  • 70,885
  • 21
  • 151
  • 164

4 Answers4

4

You can use Python's csv module and itertools.groupby.
The following example was tested on Python 2.7.1
Edit: updated answer to account for new information added to question.

import csv, itertools as it, operator as op

csv_contents = []
with open('yourfile.csv', 'rb') as fin:
  dict_reader = csv.DictReader(fin)   # default delimiter is comma
  fieldnames = dict_reader.fieldnames # save for writing
  for line in dict_reader:            # read in all of your data
    csv_contents.append(line)         # gather data into a list (of dicts)

# input to itertools.groupby must be sorted by the grouping value 
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('Country Name'))

for groupkey, groupdata in it.groupby(sorted_csv_contents, 
                                      key=op.itemgetter('Country Name')):
  with open('slice_{:s}.csv'.format(groupkey), 'wb') as fou:
    dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)
    dict_writer.writeheader()         # new method in 2.7; use writerow() in 2.6-
    dict_writer.writerows(groupdata)

Other notes:

  • You could use a regular csv reader and writer, but the DictReader and DictWriter are nice because you can reference columns by name.
  • Always use the 'b' flag when reading or writing .csv files because on Windows that makes a difference in how line-endings are handled.
  • If anything isn't clear let me know and I'll explain further!
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
4

One-liner:

awk -F "," 'NF>1 && NR>1 {print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

This creates new files named data_ABW, etc., containing the appropriate information. The NR>1 part skips the header line. Then, for each line, it appends that entire line ($0) to the file named Data_$1, where $1 is replaced with the text in the first column of that line. Finally, the close statement makes sure there aren't too many open files. If you didn't have so many countries, you could get rid of this and significantly increase the speed of the command.

In answer to @Lenwood's comment below, to include the header in each output file, you can do this:

awk -F "," 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$1]) {print header >> ("data_" $1 ".csv"); files[$1]=1}; print $0 >> ("data_" $1 ".csv"); close("data_" $1 ".csv")}' data.csv

(You may have to escape the exclamation point...) The first new part NR==1 {header=$0}; just stores the first line of the input file as the variable header. Then, the other new part if(! files[$1]) ... files[$1]=1}; uses the associative array files to keep track of all whether or not it has put the header into a given file, and if not, it puts it in there.

Note that this appends the files, so if those files already exist, they'll just get added to. Therefore, if you get new data in your main file, you'll probably want to delete those other files before you run this command again.

(In case it's not obvious, if you want the files to be named like data_Aruba you can change $1 to $2.)

Community
  • 1
  • 1
Mike
  • 19,114
  • 12
  • 59
  • 91
  • Thanks! This looks like the simplest way to achieve what I want, but `awk 'NR>1 {print $0 >> ("Data_" $1)}' data.csv -F ","` outputs `Data_BFA,Burkina makes too many open files`. I've included the sample csv in the question. – methodofaction Mar 06 '12 at 21:42
  • Sorry, I just looked at the data you included, and missed the link. I've updated the answer so that it deals with that issue by closing the file it just opened. It's a little slow, but it works -- especially if you only have to do it once. – Mike Mar 06 '12 at 22:59
  • 1
    I included the after your response. Thanks, it works like a charm and it's not slow for my needs. – methodofaction Mar 07 '12 at 02:34
  • 1
    +1, worked beautifully for me. Is it possible to include the header row in each child CSV that gets created? – Lenwood Mar 15 '13 at 20:17
  • 1
    Excellent point. I've edited the answer to include a second version that does that. (I've also added `NF>1`, which excludes the lines that have just one field, so that the lines `...` in the original question don't get processed. – Mike Mar 15 '13 at 20:46
  • For me just first file of child CSV gets the header. What am I missing? – Mitendra Nov 17 '15 at 01:26
  • Got it! if(! files[$1]) ... files[$1]=1}; In my case, $1 was not changing but some other field. – Mitendra Nov 17 '15 at 02:06
2

It is very simple with pandas Python data analysis library:

from pandas.io.parsers import read_csv

df = read_csv(input_file, header=1, sep='\t', index_col=[0,1,2])
for (country_code, country_name), group in df.groupby(level=[0,1]):
    group.to_csv(country_code+'.csv')

Result

$ for f in *.csv ; do echo $f; cat $f; echo; done

ABW.csv
Country Code,Country Name,Indicator
ABW,Aruba,% Forest coverage
ABW,Aruba,% Literacy rate
ABW,Aruba,% Another indicator

ADO.csv
Country Code,Country Name,Indicator
ADO,Andorra,% Forest coverage
ADO,Andorra,% Literacy rate
ADO,Andorra,% Another indicator

AFG.csv
Country Code,Country Name,Indicator
AFG,Afghanistan,% Forest coverage
AFG,Afghanistan,% Literacy rate
AFG,Afghanistan,% Another indicator
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

In shell scripting.

First,awk '{print $1}' | sort | uniq > code.lst will give you a list of country codes. Then you can iterate through country code and select all the lines of youfilename.csv that matches the code using grep.

for c in `ls code.lst` do
   grep $c youfilename.csv > youfilename_$c.csv
done