0

im struggling with 2 csv files which I have imported

the csv files look like this:

csv1

planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, black
...

csv2

planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, blue
...

in both csv files, there are the same planets but in a different order and sometimes with different values (a mismatch)

the data for each planet (diameter, discovered and color) has been entered independently. I wanted to Cross-check the two sheets and find all the fields that are mismatched. Then I want to generate a new file that contains one line per error with a description of the error.

for example: sceptri: mismatch (black/blue)

here is my code so far


with open('planets1.csv') as csvfile:
    a = csv.reader(csvfile, delimiter=',')
    data_a= list(a)
    for row in a:
        print(row)

with open('planets2.csv') as csvfile:
    b = csv.reader(csvfile, delimiter=',')
    data_b= list(b)
    for row in b:
        print(row)

print(data_a)
print(data_b)

c= [data_a]
d= [data_b]```


thank you in advance for your help!
Shuzo
  • 31
  • 1
  • 1
  • 5

3 Answers3

0

Assuming the name of planets are correct in both files, here is my proposal

# Working with list of list, which could be get csv file reading:
csv1 = [["sceptri",33.41685587,"28-11-1611 05:15", "black"],
        ["foo",35.41685587,"29-11-1611 05:15", "black"],
        ["bar",38.7,"29-11-1611 05:15", "black"],]
csv2 = [["foo",35.41685587,"29-11-1611 05:15", "black"],
        ["bar",38.17,"29-11-1611 05:15", "black"],
        ["sceptri",33.41685587,"28-11-1611 05:15", "blue"]]

# A list to contain the errors:
new_file = []
# A dict to check if a planet has already been processed:
a_dict ={}
# Let's read all planet data:
for planet in csv1+csv2:
    # Check if planet is already as a key in a_dict:
    if planet[0] in a_dict:
        # Yes, sir, need to check discrepancies.
        if a_dict[planet[0]] != planet[1:]:
            # we have some differences in some values.
            # Put both set of values in python sets to differences:
            error = set(planet[1:]) ^ set(a_dict[planet[0]])
            # Append [planet_name, diff.param1, diff_param2] to new_file:
            new_file.append([planet[0]]+list(error))
    else:
        # the planet name becomes a dict key, other param are key value:
        a_dict[planet[0]] = planet[1:]

print(new_file)
# [['bar', 38.17, 38.7], ['sceptri', 'black', 'blue']]

The list new_file may be saved as new file, see Writing a list to file

Eric Frigade
  • 128
  • 6
0

I'd suggest using Pandas for a task like this.

Firstly, you'll need to read the csv contents into dataframe objects. This can be done as follows:

import pandas as pd

# make a dataframe from each csv file
df1 = pd.read_csv('planets1.csv')
df2 = pd.read_csv('planets2.csv')

You may want to declare names for each column if your CSV file doesn't have them.

colnames = ['col1', 'col2', ..., 'coln']

df1 = pd.read_csv('planets1.csv', names=colnames, index_col=0) 
df2 = pd.read_csv('planets2.csv', names=colnames, index_col=0) 
# use index_col=0 if csv already has an index column


For the sake of reproducible code, I will define dataframe objects without a csv below:
import pandas as pd

# example column names
colnames = ['A','B','C']

# example dataframes
df1 = pd.DataFrame([[0,3,6], [4,5,6], [3,2,5]], columns=colnames)
df2 = pd.DataFrame([[1,3,1], [4,3,6], [3,6,5]], columns=colnames)

Note that df1 looks like this:

    A   B   C
---------------
0   0   3   6
1   4   5   6
2   3   2   5

And df2 looks like this:

    A   B   C
---------------
0   1   3   1
1   4   3   6
2   3   6   5

The following code compares dataframes, concatenate the comparison to a new dataframe, and then saves the result to a CSV:

# define the condition you want to check for (i.e., mismatches)
mask = (df1 != df2)

# df1[mask], df2[mask] will replace matched values with NaN (Not a Number), and leave mismatches
# dropna(how='all') will remove rows filled entirely with NaNs
errors_1 = df1[mask].dropna(how='all')
errors_2 = df2[mask].dropna(how='all')

# add labels to column names
errors_1.columns += '_1' # for planets 1
errors_2.columns += '_2' # for planets 2

# you can now combine horizontally into one big dataframe
errors = pd.concat([errors_1,errors_2],axis=1)

# if you want, reorder the columns of `errors` so compared columns are next to each other
errors = errors.reindex(sorted(errors.columns), axis=1)

# if you don't like the clutter of NaN values, you can replace them with fillna()
errors = errors.fillna('_')

# save to a csv
errors.to_csv('mismatches.csv')

The final result looks something like this:

    A_1 A_2 B_1 B_2 C_1 C_2
-----------------------------
0   0   1   _   _   6   1
1   _   _   5   3   _   _
2   _   _   2   6   _   _

Hope this helps.

bug_spray
  • 1,445
  • 1
  • 9
  • 23
0

This kind of problem can be solved by sorting the rows from the csv files, and then comparing the corresponding rows to see if there are differences.

This approach uses a functional style to perform the comparisons and will compare any number of csv files.

It assumes that the csvs contain the same number of records, and that the columns are in the same order.

import contextlib
import csv


def compare_files(readers):
    colnames = [next(reader) for reader in readers][0]
    sorted_readers = [sorted(r) for r in readers]
    for gen in [compare_rows(colnames, rows) for rows in zip(*sorted_readers)]:
        yield from gen


def compare_rows(colnames, rows):
    col_iter = zip(*rows)
    # Be sure we're comparing the same planets.
    planets = set(next(col_iter))
    assert len(planets) == 1, planets
    planet = planets.pop()
    for (colname, *vals) in zip(colnames, col_iter):
        if len(set(*vals)) > 1:
            yield f"{planet} mismatch {colname} ({'/'.join(*vals)})"


def main(outfile, *infiles):
    with contextlib.ExitStack() as stack:
        csvs = [stack.enter_context(open(fname)) for fname in infiles]
        readers = [csv.reader(f) for f in csvs]
        with open(outfile, 'w') as out:
            for result in compare_files(readers):
                out.write(result + '\n')


if __name__ == "__main__":
    main('mismatches.txt', 'planets1.csv', 'planets2.csv')
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153