Merging CSV's using only selected columns

Question

I have three CSV's that I am attempting to merge together. The three CSV's have a range of different columns though all 3 have a set of columns that are consistent between them.

I have the column headers for these columns stored in a Python List as strings. These columns are the only ones I want to merge.

As a side note, I have attempted to implement an answer in this thread using a library called brewery though am recieving an error message

'CSVDataSource' object has no attribute 'field_names'

I feel that this could be easily done using the CSV module though am unsure how to go about searching each CSV for the right columns and then only merging those columns. Another issue I am unsure on how to approach is ensuring the correct positioning of each column throughout the merged CSV. IE if columnA is the 5th across in the 1st CSV and then is the 4th across in the 2nd CSV how will I ensure they are in the same position in the merged CSV?

I am using Python 2.7 and unfortunately don't have access to Pandas.

Code example:

Columns_to_Merge = ['ColumnA','ColumnB','ColumnC']

# CSV1

ColumnA,ColumnB,ColumnF,ColumnC
2,3,4,9
8,2,5,7
1,2,3,4

# CSV2

ColumnD,ColumnA,ColumnC,ColumnB,ColumnH
2,3,4,9,12
8,2,5,7,2
1,2,3,4,5

# CSV3

ColumnH,ColumnJ,ColumnA,ColumnB,ColumnC
2,3,4,99,12
8,5,5,7,2
1,55,3,70,5

Please provide a [Minimal working example](https://stackoverflow.com/help/mcve) from which we can start. — languitar, Mar 02 '17 at 09:17
You could read the first line of each csv file and determine the position of the column, i.e. the index. Knowing the position you know which values to search for. — elena, Mar 02 '17 at 09:25

score 0 · Accepted Answer · answered Mar 02 '17 at 10:18

Since pandas cannot be used, I would use numpy as follows:

# first get all the columns of each csv file as lists
csv1_cols = ['ColumnA','ColumnB','ColumnF','ColumnC']
csv2_cols = ['ColumnD','ColumnA','ColumnC','ColumnB','ColumnH']
csv3_cols = ['ColumnH','ColumnJ','ColumnA','ColumnB','ColumnC']

# then get the indices of the columns that you want to keep
idxs_colA = [csv1_cols.index('ColumnA'), csv2_cols.index('ColumnA'), csv3_cols.index('ColumnA')]
idxs_colB = [csv1_cols.index('ColumnB'), csv2_cols.index('ColumnB'), csv3_cols.index('ColumnB')]
idxs_colC = [csv1_cols.index('ColumnC'), csv2_cols.index('ColumnC'), csv3_cols.index('ColumnC')]

# get the columns as vectors and flatten them
colA = np.array([csv1[:,idxs_colA[0]], csv2[:,idxs_colA[1]], csv3[:,idxs_colA[2]]]).flatten()
colB = np.array([csv1[:,idxs_colB[0]], csv2[:,idxs_colB[1]], csv3[:,idxs_colB[2]]]).flatten()
colC = np.array([csv1[:,idxs_colC[0]], csv2[:,idxs_colC[1]], csv3[:,idxs_colC[2]]]).flatten()

# finally, create a new np array (with the cols in the order you want)
# and transpose it
new_csv = np.array([colA, colB, colC]).T

Quite ugly, but it works.

Thanks very much! Should the variables csv1, csv2 and csv3 be filenames to their respective csv's? Furthermore, how do I then take the variable `new_csv` and export that as a CSV file? — Jamie, Mar 02 '17 at 10:32
csv1, csv2 and csv3 should contain the data of your csv files. You can read the csv files into numpy arrays like this: `csv1 = np.genfromtext('my_csv1.csv', delimiter=',')`. Then you can write the `new_csv` to a file like this: `np.savetxt('my_new_csv.csv', new_csv, delimiter=',')`. — Glrs, Mar 02 '17 at 10:42
In case your data are just integers, you may find it helpful to save them as integers. To do so, just add this `fmt='%i'` argument inside the `np.savetxt()`. — Glrs, Mar 02 '17 at 10:57

Merging CSV's using only selected columns

1 Answers1