1

I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.

CSV1

reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433

CSV2

reference
234 8A
RT4 VV8

CODE

 import pandas as pd
    df1 = pd.read_csv(r'd:\temp\data1.csv')
    df2 = pd.read_csv(r'd:\temp\data2.csv')
    df3 = pd.merge(df1,df2, on= 'reference', how='inner')
    df3.to_csv('outpt.csv')

I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.

fightstarr20
  • 11,682
  • 40
  • 154
  • 278

1 Answers1

1

most probably you have either leading or trailing white spaces in reference column after reading your CSV files.

you can check it in this way:

print(df1.columns.tolist())
print(df2.columns.tolist())

you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls

Example:

In [74]: df1
Out[74]:
  reference    name  house
0    234 8A    john     37
1   564 68R    bill      3
2   RT4 VV8    kate     88
3      76AA  harry     433

In [75]: df2
Out[75]:
  reference
0     234 8A
1    RT4 VV8

In [76]: df2.columns.tolist()
Out[76]: ['reference ']

In [77]: df1.columns.tolist()
Out[77]: ['reference', 'name', 'house']

In [78]: df1.merge(df2, on='reference')

...

KeyError: 'reference'

fixing df2:

data = """\
reference 
234 8A
RT4 VV8"""
df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')

now it works:

In [80]: df1.merge(df2, on='reference')
Out[80]:
  reference  name  house
0    234 8A  john     37
1   RT4 VV8  kate     88
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419