0

I am new in python so I'm trying to read a csv with 700 lines included a header, and get a list with the unique values of the first csv column.

Sample CSV:

SKU;PRICE;SUPPLIER
X100;100;ABC
X100;120;ADD
X101;110;ABV
X102;100;ABC
X102;105;ABV
X100;119;ABG

I used the example here

How to create a list in Python with the unique values of a CSV file?

so I did the following:

import csv
mainlist=[]
with open('final_csv.csv', 'r', encoding='utf-8') as csvf:
    rows = csv.reader(csvf, delimiter=";")
    for row in rows:
        if row[0] not in rows:
            mainlist.append(row[0])
print(mainlist)

I noticed that in debugging, rows is 1 line not 700 and I get only the

['SKU'] field what I did wrong?

thank you

dzang
  • 2,160
  • 2
  • 12
  • 21

3 Answers3

4

A solution using pandas. You'll need to call the unique method on the correct column, this will return a pandas series with the unique values in that column, then convert it to a list using the tolist method.

An example on the SKU column below.

import pandas as pd 

df = pd.read_csv('final_csv.csv', sep=";")
sku_unique = df['SKU'].unique().tolist()

If you don't know / care for the column name you can use iloc on the correct number of column. Note that the count index starts at 0:

df.iloc[:,0].unique().tolist()

If the question is intending get only the values occurring once then you can use the value_counts method. This will create a series with the index as the values of SKU with the counts as values, you must then convert the index of the series to a list in a similar manner. Using the first example:

import pandas as pd 

df = pd.read_csv('final_csv.csv', sep=";")
sku_counts = df['SKU'].value_counts()
sku_single_counts = sku_counts[sku_counts == 1].index.tolist()
B.C
  • 577
  • 3
  • 18
  • Even if I imported the pandas module version 0.24.2 I get unable to resolve module pandas with wisual studio –  Mar 27 '19 at 10:44
  • This sounds like an issue with your setup, not with pandas itself. Once you get that setup it will make handling these types of csv files much easier. – B.C Mar 27 '19 at 10:50
  • did you actually install pandas @Nikos – Underoos Mar 27 '19 at 11:24
  • @I remove it and installed it again the same..maybe is the visual studio? –  Mar 27 '19 at 11:29
  • 1
    you need to do `pd.read_csv('final_csv.csv', sep=";")` or the semicolon isn't taken as a separator and the whole line goes into one column, but still, that doesn't return the right answer too – Nenri Mar 27 '19 at 13:18
  • but the question asks for "values occuring only once" not removing duplicates – Nenri Mar 27 '19 at 13:25
  • 1
    @Nenri both cases now answered in above – B.C Mar 27 '19 at 13:36
1

If you want the unique values of the first column, you could modify your code to use a set instead of a list. Maybe like this:

import collections
import csv
filename = 'final_csv.csv'

sku_list = []
with open(filename, 'r', encoding='utf-8') as f:
    csv_reader = csv.reader(f, delimiter=";")

    for i, row in enumerate(csv_reader):
        if i == 0:
            # skip the header
            continue

        try:
            sku = row[0]
            sku_list.append(sku)
        except IndexError:
            pass

print('All SKUs:')
print(sku_list)

sku_set = set(sku_list)
print('SKUs after removing duplicates:')
print(sku_set)

c = collections.Counter(sku_list)
sku_list_2 = [k for k, v in c.items() if v == 1]
print('SKUs that appear only once:')
print(sku_list_2)

with open('output.csv', 'w') as f:
    for sku in sorted(sku_set):
        f.write('{}\n'.format(sku))
Ralf
  • 16,086
  • 4
  • 44
  • 68
  • I get sku =row[0] list index out of range again, why?!? –  Mar 27 '19 at 10:41
  • @Nikos it suggests that `row` is empty. What do you get with `print(i, row)`? – Ralf Mar 27 '19 at 10:43
  • @Nikos I edited my answer; you can wrap it inside a try/except block to account for empty lines in the CSV file. – Ralf Mar 27 '19 at 10:45
  • Yes it seems that it is working can I export it to a csv to check it better? –  Mar 27 '19 at 10:49
  • 1
    @Nikos I added lines to export output to a new file; I did without the CSV module since it is only one field per line, but feel free to use `csv.writer` if you want. – Ralf Mar 27 '19 at 10:52
  • thank you all for your responses big thank you ralf –  Mar 27 '19 at 10:55
  • You are aware this doesn't answer your question right @Nikos ? – Nenri Mar 27 '19 at 11:44
  • @Nenri could you elaborate how this does not answer the question; the first paragraph of the question says "get a list with the unique values of the first csv column", and I think this answer does just that. – Ralf Mar 27 '19 at 12:20
  • @Ralf this code gets all values, while removing the duplicates, which is not what is asked. For example with `['X100', 'X100', 'X101', 'X102', 'X102', 'X100']` ur code returns `X100 X102 X101` when the expected result is `X101` (value occuring only once) look at my answer (or B.C's answer) to see a working code – Nenri Mar 27 '19 at 13:08
  • 1
    @Nenri Ah, I see how the question could also be interpreted that way. I edited my answer to include both options (using `collections.Counter`). – Ralf Mar 27 '19 at 14:18
0

A solution using neither pandas nor csv :

lines = open('file.csv', 'r').read().splitlines()[1:]

col0 = [v.split(';')[0] for v in lines]

uniques = filter(lambda x: col0.count(x) == 1, col0)

or, using map (but less readable) :

col0 = list(map(lambda line: line.split(';')[0], open('file.csv', 'r').read().splitlines()[1:]))

uniques = filter(lambda x: col0.count(x) == 1, col0)
Nenri
  • 477
  • 3
  • 18