2

I have a semicolon separated csv file which has the following form:

indx1; string1; char1; entry1 
indx2; string1; char2; entry2 
indx3; string2; char2; entry3 
indx4; string1; char1; entry4 
indx5; string3; char2; entry5 

I want to get unique entries of the 1st and 2nd columns of this file in the form of a list (without using pandas or numpy). In particular these are the lists that I desire:

[string1, string2, string3] 
[char1, char2]

The order doesn't matter, and I would like the operation to be fast.

Presently, I am reading the file (say 'data.csv') using the command

with open('data.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')

I am using python 2.7. What is the fastest way to achieve the functionality that I desire? I will appreciate any help.

Ji Won Song
  • 140
  • 2
  • 9
  • Do you want unique combinations of `(col1, col2)` or all unique `col1` and all unique `col2` values? – nbwoodward Oct 29 '18 at 14:54
  • Possible duplicate of [How to create a list in Python with the unique values of a CSV file?](https://stackoverflow.com/questions/24441606/how-to-create-a-list-in-python-with-the-unique-values-of-a-csv-file) – jtweeder Oct 29 '18 at 15:14

2 Answers2

3

You could use sets to keep track of the already seen values in the needed columns. Since you say that the order doesn't matter, you could just convert the sets to lists after processing all rows:

import csv

col1, col2 = set(), set()

with open('data.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';', skipinitialspace=True)        
    for row in csv_reader:
        col1.add(row[1])
        col2.add(row[2])

print list(col1), list(col2)  # ['string1', 'string3', 'string2'] ['char2', 'char1']
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
2

This should work. You can use it as benchmark.

myDict1 = {}
myDict2 = {}
with open('data.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
    for row in csv_reader:
        myDict1[row[1]] = 0
        myDict2[row[2]] = 0

x = myDict1.keys() 
y = myDict2.keys() 
jimifiki
  • 5,377
  • 2
  • 34
  • 60
  • Thanks jimifiki, your solution was very helpful. It worked. =) – Ji Won Song Oct 29 '18 at 15:47
  • hi @jimifiki, I am getting the output like `dict_keys(['blla1','blla2'])` is there any way of printing only the keys without the `dict_keys` so to print only `['blla1','blla2']` – asha Jul 16 '20 at 09:46
  • sure @AlbionShala `list(myDict.keys())` constructs a list out of the dict_keys. So I would write `print(list(myDict.keys()))`, this should be fine. Have fun with Python's data structures ;-) – jimifiki Jul 16 '20 at 16:42
  • Actually in that way it prints all of them, so there are not only unique keys. What I did was just to iterate through `x` in your previous example, so `for y in x` ... `print(y)`. Thanks – asha Jul 17 '20 at 08:02