How to get the values occurring only once in first column of a csv file using python

Question

I am new in python so I'm trying to read a csv with 700 lines included a header, and get a list with the unique values of the first csv column.

Sample CSV:

SKU;PRICE;SUPPLIER
X100;100;ABC
X100;120;ADD
X101;110;ABV
X102;100;ABC
X102;105;ABV
X100;119;ABG

I used the example here

How to create a list in Python with the unique values of a CSV file?

so I did the following:

import csv
mainlist=[]
with open('final_csv.csv', 'r', encoding='utf-8') as csvf:
    rows = csv.reader(csvf, delimiter=";")
    for row in rows:
        if row[0] not in rows:
            mainlist.append(row[0])
print(mainlist)

I noticed that in debugging, rows is 1 line not 700 and I get only the

['SKU'] field what I did wrong?

thank you

@Nikos why would it return an indexerror ? `mainlist` doesn't have to have values to test if a value is in it. — Nenri, Mar 27 '19 at 10:33
@Nenri if I do if row[0] not in mainlist returns index error — , Mar 27 '19 at 10:33
How did you do it ? We mean replacing `if row[0] not in rows:` by `if row[0] not in mainlist:` — Nenri, Mar 27 '19 at 10:34
@nenri Yes if you do if row[0] not in mainlist: returns index error — , Mar 27 '19 at 10:36

B.C · Answer 1 · 2019-03-27T13:34:39.337

4

A solution using pandas. You'll need to call the unique method on the correct column, this will return a pandas series with the unique values in that column, then convert it to a list using the tolist method.

An example on the SKU column below.

import pandas as pd 

df = pd.read_csv('final_csv.csv', sep=";")
sku_unique = df['SKU'].unique().tolist()

If you don't know / care for the column name you can use iloc on the correct number of column. Note that the count index starts at 0:

df.iloc[:,0].unique().tolist()

If the question is intending get only the values occurring once then you can use the value_counts method. This will create a series with the index as the values of SKU with the counts as values, you must then convert the index of the series to a list in a similar manner. Using the first example:

import pandas as pd 

df = pd.read_csv('final_csv.csv', sep=";")
sku_counts = df['SKU'].value_counts()
sku_single_counts = sku_counts[sku_counts == 1].index.tolist()

edited Mar 27 '19 at 13:34

answered Mar 27 '19 at 10:37

B.C

577
3
18

Even if I imported the pandas module version 0.24.2 I get unable to resolve module pandas with wisual studio – Mar 27 '19 at 10:44
This sounds like an issue with your setup, not with pandas itself. Once you get that setup it will make handling these types of csv files much easier. – B.C Mar 27 '19 at 10:50
did you actually install pandas @Nikos – Underoos Mar 27 '19 at 11:24
@I remove it and installed it again the same..maybe is the visual studio? – Mar 27 '19 at 11:29
1

you need to do `pd.read_csv('final_csv.csv', sep=";")` or the semicolon isn't taken as a separator and the whole line goes into one column, but still, that doesn't return the right answer too – Nenri Mar 27 '19 at 13:18
but the question asks for "values occuring only once" not removing duplicates – Nenri Mar 27 '19 at 13:25
1

@Nenri both cases now answered in above – B.C Mar 27 '19 at 13:36

Ralf · Accepted Answer · 2019-03-27T14:16:29.527

1

If you want the unique values of the first column, you could modify your code to use a set instead of a list. Maybe like this:

import collections
import csv
filename = 'final_csv.csv'

sku_list = []
with open(filename, 'r', encoding='utf-8') as f:
    csv_reader = csv.reader(f, delimiter=";")

    for i, row in enumerate(csv_reader):
        if i == 0:
            # skip the header
            continue

        try:
            sku = row[0]
            sku_list.append(sku)
        except IndexError:
            pass

print('All SKUs:')
print(sku_list)

sku_set = set(sku_list)
print('SKUs after removing duplicates:')
print(sku_set)

c = collections.Counter(sku_list)
sku_list_2 = [k for k, v in c.items() if v == 1]
print('SKUs that appear only once:')
print(sku_list_2)

with open('output.csv', 'w') as f:
    for sku in sorted(sku_set):
        f.write('{}\n'.format(sku))

edited Mar 27 '19 at 14:16

answered Mar 27 '19 at 10:39

Ralf

16,086
4
44
68

I get sku =row[0] list index out of range again, why?!? – Mar 27 '19 at 10:41
@Nikos it suggests that `row` is empty. What do you get with `print(i, row)`? – Ralf Mar 27 '19 at 10:43
@Nikos I edited my answer; you can wrap it inside a try/except block to account for empty lines in the CSV file. – Ralf Mar 27 '19 at 10:45
Yes it seems that it is working can I export it to a csv to check it better? – Mar 27 '19 at 10:49
1

@Nikos I added lines to export output to a new file; I did without the CSV module since it is only one field per line, but feel free to use `csv.writer` if you want. – Ralf Mar 27 '19 at 10:52
thank you all for your responses big thank you ralf – Mar 27 '19 at 10:55
You are aware this doesn't answer your question right @Nikos ? – Nenri Mar 27 '19 at 11:44
@Nenri could you elaborate how this does not answer the question; the first paragraph of the question says "get a list with the unique values of the first csv column", and I think this answer does just that. – Ralf Mar 27 '19 at 12:20
@Ralf this code gets all values, while removing the duplicates, which is not what is asked. For example with `['X100', 'X100', 'X101', 'X102', 'X102', 'X100']` ur code returns `X100 X102 X101` when the expected result is `X101` (value occuring only once) look at my answer (or B.C's answer) to see a working code – Nenri Mar 27 '19 at 13:08
1

@Nenri Ah, I see how the question could also be interpreted that way. I edited my answer to include both options (using `collections.Counter`). – Ralf Mar 27 '19 at 14:18

Nenri · Answer 3 · 2019-03-27T13:12:37.510

0

A solution using neither pandas nor csv :

lines = open('file.csv', 'r').read().splitlines()[1:]

col0 = [v.split(';')[0] for v in lines]

uniques = filter(lambda x: col0.count(x) == 1, col0)

or, using map (but less readable) :

col0 = list(map(lambda line: line.split(';')[0], open('file.csv', 'r').read().splitlines()[1:]))

uniques = filter(lambda x: col0.count(x) == 1, col0)

edited Mar 27 '19 at 13:12

answered Mar 27 '19 at 10:40

Nenri

477
3
18

How to get the values occurring only once in first column of a csv file using python

3 Answers3