0

I have two large csv files and I want to compare column1 in csv1 with column1 in csv2. I was able to do this using Python List where I read csv1 and throw column1 in list1, do the same thing with csv2 and then check to see if element in list1 is present in list2

olist = []
def oldList(self):
    for row in self.csvreaderOld:
        self.olist.append(row[1])

nlist = []
def newList(self):
    for row in self.csvreaderNew:
        self.nlist.append(row[1])

def new_list(self):
    return [item for item in self.olist if item not in self.nlist]

the code works but can a long time to complete. I am trying to see if I can use dictionary instead, see if that would be faster, so I can compare keys in dictionary1 exist in dictionary2 but so far havent been successfully owing to my limited knowledge.

Ronron
  • 69
  • 1
  • 2
  • 11

3 Answers3

1

If it's a big CSV file or your'e planning to continue working with tables, I would suggest doing it with the Pandas module.

To be honest, even if it's a small file, or you're not going to continue working with tables, Pandas is an excellent module.

From what I understand (and I might be mistaken), for reading CSV files, Pandas is one of the quickest libraries to do so.

import pandas as pd

df = pd.read_csv("path to your csv file", use_cols = ["column1", "column2"])

def new_list(df):
    return [item for item in df["column2"].values if item not in df["column1"].values]

It's important to use .values when checking for an item in a pandas series (when you're extracting a column in a DataFrame you're getting a pandas series)

You could also use list(df["column1"]) and the other methods suggested in How to determine whether a Pandas Column contains a particular value for determining whether a value is contains in a pandas column

for example :

df = pd.DataFrame({"column1":[1,2,3,4], "column2":[2,3,4,5]})

the data frame would be

column1   column2
1   2
2   3
3   4
4   5

and new_line would return [5]

  • Hi @'Guy vandam', you have to read both files, according to the question requirement. – Itzik Chaimov May 24 '21 at 14:06
  • 1
    Hey @ItzikChaimov, You're right, I got confused with the column and file numbering. It's a similar answer, you can read 2 dataframes, df1 and df2 and get `column1` from each. also, you should use `usecols=` instead of `columns=` when reading from a csv in pandas. My bad.. – Guy Van-Dam May 25 '21 at 07:32
0

You can read both files into objects and compare in a single loop. Here is a short code snippet for the idea (not class implementation):

fsOld = open('oldFile.csv', 'r')
fsNew = open('newFile.csv', 'r')
fsLinesOld = fsOld.readlines()
fsLinesNew = fsNew.readlines()
outList = []

# assumes lines are same for both files data:
for i in range(0, fsLinesOld.__len__(), 1):
    if ( fsLinesOld[i] == fsLinesNew[i]):
        outList.append(fsLinesOld[i])
Itzik Chaimov
  • 89
  • 1
  • 10
0

First of all, change the way of reading the CSV files, if you want just one column mention that in usecols, like this

df = pd.read_csv("sample_file.csv", usecols=col_list)

And second, you can use set difference if you are not comparing row to row, like this

set(df.col.to_list()).difference(set(df2.col.to_list()))
Pawan Jain
  • 815
  • 3
  • 15