how to make a unique data from strings

Question

I have a data like this . the strings are separated by comma.

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

What I am trying to do is to put them all in one column (one under each other) So it will become like this

India1
India2
myIndia
Where
Here
Here
Here
Where
India
uyete
AFD
TTT

Then I keep the unique ones which lead to this

India1
India2
myIndia
Where
Here
India
uyete
AFD
TTT

So I have the first data in a .txtformat and I have tried to use numpyfor this

This is my code

#!/usr/bin/python
import numpy as np

# give a name to my data 
file_name = 'path to my data/test.txt'
# set my output 
with open ( 'output.txt' , 'w' ) as out:
    # read all the lines
    for n , line in enumerate ( open ( file_name ).readlines ( ) ):
        # split each stirg from another one by a comma
        item1 = file_name.split ( ',' )
    myList = ','.join ( map ( str , item1 ) )
    item2 = np.unique ( myList , return_inverse=True )
    # save the data into out
    out.write ( item2 )

I was getting TypeError: expected a character buffer object

I have searched it and I found several post like TypeError: expected a character buffer object - while trying to save integer to textfile

and If I added out.seek ( 0 ) I still got the same error

but by changing it to out.write ( str(item2 )) thanks to TypeError: expected a character buffer object I get no error however, the output is showing this

(array(['/path to the file/test.txt'], dtype='|S29'), array([0]))

Below is given a soltuion which I tried to use

import csv

data = []
def remove_quotes(file):
    for line in file:
        yield line.strip ( '"\n' )
with open ( 'test.txt' ) as f:
    reader = csv.reader ( remove_quotes ( f ) )
    for row in reader:
        data.extend ( row )

No error but also data is not generated

See **`unique_everseen`** in the [recipes](https://docs.python.org/2/library/itertools.html#recipes) section of the [**`itertools`**](https://docs.python.org/2/library/itertools.html) documentation. — Peter Wood, Dec 21 '16 at 13:55
File name cannot contain commas. You used the wrong variable — OneCricketeer, Dec 21 '16 at 13:57
@Peter Wood nice point, I was not aware of this `unique_everseen` — nik, Dec 21 '16 at 13:57
Also `myList` is a string, not a list. You joined `item1` back on the commas that you split by, therefore essentially recreating the comma separated `line` — OneCricketeer, Dec 21 '16 at 13:59
@cricket_007 so you mean I should not have used `item1 = file_name.split ( ',' )` — nik, Dec 21 '16 at 13:59
@cricket_007 I thought I should split the data because I want to have each string speerated from the other one `'path to my data/test.txt'.split(',')` also if i use `myList = " ".join ( map ( str , item1 ) ) the same empty output` — nik, Dec 21 '16 at 14:04
No, you want to split the *lines from the file*, not the *file name*. You used the wrong variable, as I've said a few times now. Also, the lines under `item1` should probably be indented — OneCricketeer, Dec 21 '16 at 14:11
@cricket_007 I don't get any indent problem. I moved them forward but nothing changes, the output is the same — nik, Dec 21 '16 at 14:14
I know you don't get an error, but think about the indentation you have. `myList` only gets the **last** `item1` because it is **outside** the for loop — OneCricketeer, Dec 21 '16 at 14:17
@cricket_007 if i indent it as you say, then I get an error `ValueError: I/O operation on closed file` — nik, Dec 21 '16 at 16:47

blacksite · Accepted Answer · 2016-12-22T12:58:39.763

stack.txt below contains this:

"India1,India2,myIndia"
"Where,Here,Here"
"Here,Where,India,uyete"
"AFD,TTT"

Here you go:

from collections import OrderedDict

with open("stack.txt", "r") as f:
    # read your data in from the gist site and strip off any new-line characters
    data = [eval(line.strip()) for line in f.readlines()]
    # get individual words into a list
    individual_elements = [word for row in data for word in row.split(",")]
    # remove duplicates and preserve order
    uniques = OrderedDict.fromkeys(individual_elements)   
    # convert from OrderedDict object to plain list
    final = [word for word in uniques]

print(final)

Which yields this:

['India1', 'India2', 'myIndia', 'Where', 'Here', 'India', 'uyete', 'AFD', 'TTT']

Edit: To get your desired output, just print the list in the format you want:

print("\n".join(final))

Which is equivalent, from an output standpoint, to this:

for x in final:
    print(x)

Which yields this:

India1
India2
myIndia
Where
Here
India
uyete
AFD
TTT

I like your answer already ! just one thing, would it be possible to have the output as a column without any `,`and `'` one under the other ? if so, I accept and like your answer — nik, Dec 22 '16 at 07:54
`final` is a `list` object, hence why it has `'` and `,` characters separating its string elements. Will update. — blacksite, Dec 22 '16 at 12:55

JDB · Answer 2 · 2016-12-21T13:57:57.610

-1

Why using numpy??? and I'm not sure if you want to use the same file as input and output

#!/usr/bin/env python


# give a name to my data 
inputData = """India1,India2,myIndia
Where,Here,Here   
Here,Where,India,uyete
AFD,TTT"""

# if you want to read the data from a file
#inputData = open(fileName, 'r').readlines()

outputData = ""
tempData = list()
for line in inputData.split("\n"):
    lineStripped = line.strip()
    lineSplit = lineStripped.split(',')
    lineElementsStripped = [element.strip() for element in lineSplit]
    tempData.extend( lineElementsStripped )
tempData = set(tempData)
outputData = "\n".join(tempData)
print("\nInputdata: \n%s" % inputData)
print("\nOutputdata: \n%s" % outputData)

edited Dec 21 '16 at 13:57

answered Dec 21 '16 at 13:56

JDB

34
5

2

Is maintaining order important? You should probably ask for clarification of the question before providing an answer. – Peter Wood Dec 21 '16 at 13:57
Everything which is not explicitly requested is not important to me. – JDB Dec 21 '16 at 14:03
Something like `from collections import OrderedDict; tempData = OrderedDict.fromkeys(tempData).keys()` should preserve the order. – blacksite Dec 21 '16 at 14:04
@not_a_robot what is this `tempData` – nik Dec 21 '16 at 14:09
@JDB the order is important for me. Look at my question above. I showed how the output looks like – nik Dec 21 '16 at 14:10
@not_a_robot I want to accept your answer if you correct it. I did not down vote your answer. The only problem I have is that the data structure you use is not the same as what I asked. Here I shared the data https://gist.github.com/anonymous/63b1a70e913c1453b0de9d7027b5973a if you correct your answer I liked and accept your answer – nik Dec 21 '16 at 22:32
@JDB I am going to accept your answer. Please read my comment above. By the way, I did not downvote your answer – nik Dec 21 '16 at 22:44
@nik I've added an answer above with the `collections.OrderedDict` implementation. – blacksite Dec 22 '16 at 00:29

Wayne Werner · Answer 3 · 2016-12-21T14:18:11.097

-1

It sounds like you probably have a csv file. You don't need numpy for that, the included batteries are all you need.

 import csv

 data = []
 with open('test.txt') as f:
     reader = csv.reader(f)
     for row in reader:
         data.extend(row)

You can .extend lists rather than .append to them. It's basically like saying

for thing in row:
    data.append(thing)

That will still leave the duplicates, though. If you don't care about order you can just make it a set and call .update() instead of extend:

 data = set()
 with open('test.txt') as f:
     reader = csv.reader(f)
     for row in reader:
         data.extend(row)

And now everything is unique. But if you care about order you'll have to filter things down a bit:

unique_data = []
for thing in data:
    if thing not in unique_data:
        unique_data.append(thing)

If your test.txt file contains this text:

"India1,India2,myIndia     "
"Where,Here,Here   "
"Here,Where,India,uyete"
"AFD,TTT"

And not

India1,India2,myIndia     
Where,Here,Here   
Here,Where,India,uyete
AFD,TTT

Then you don't quite have a csv. You can either fix what's generating your csv or manually remove the quotes or just fix it on the fly.

def remove_quotes(file):
    for line in file:
        yield line.strip('"\n')

reader = csv.reader(remove_quotes(f))

edited Dec 21 '16 at 14:18

answered Dec 21 '16 at 14:03

Wayne Werner

49,299
29
200
290

Does your file literally contain `"foo,bar,thing,quux"\n"next,line,goes,here"\n`? If so you'll want to either fix your csv or wrap the file. – Wayne Werner Dec 21 '16 at 14:09
yes I share an example here https://gist.github.com/anonymous/63b1a70e913c1453b0de9d7027b5973a – nik Dec 21 '16 at 14:15
BTW, the quotes are included in the file, apparently... There's been multiple questions from OP containing this data – OneCricketeer Dec 21 '16 at 14:18
@nik well then you definitely want the `remove_quotes` wrapper. – Wayne Werner Dec 21 '16 at 14:18
@Wayne Werner where to put the last `remove_quotes?`my mean merging the remove_quotes with the first solution you gave , how to pass the reader to that ? – nik Dec 21 '16 at 15:57
@Wayne Werner even if I correct my data, without quotation, your first method does not produce any output – nik Dec 21 '16 at 16:37
@cricket_007 I posted how I used your solution above which does not generate anything – nik Dec 21 '16 at 16:43
@nik I didn't give you a solution - I pointed out a logical flaw in your code – OneCricketeer Dec 21 '16 at 16:56
@nik I assumed that you would put this in your existing code and do your own output. – Wayne Werner Dec 21 '16 at 21:19
@Wayne Werner even without the quotation your solution does not work – nik Dec 21 '16 at 22:12
"does not work" is not [mcve]. If you can't explain what you expect then how do you expect anyone to help you? – Wayne Werner Dec 21 '16 at 22:15
@Wayne Werner does not work means I showed you , I shared with you a data and I showed which output I am looking for. Print the output of the first code you gave , you see what I mean with not working? – nik Dec 21 '16 at 22:30
@Wayne Werner **This** is the data https://gist.github.com/anonymous/63b1a70e913c1453b0de9d7027b5973a I want to have Unique Strings . Is it not clear enough??? – nik Dec 21 '16 at 22:31
No, because my code produces unique strings and you say that it doesn't. So apparently you mean something *entirely different than what you're saying*. – Wayne Werner Dec 21 '16 at 22:35
@Wayne Werner print the input and output in your answer then – nik Dec 21 '16 at 22:43
If you can't figure out how to add a print function you may want to go back to the [official python tutorial](https://docs.python.org/3/tutorial/index.html) – Wayne Werner Dec 22 '16 at 08:57

how to make a unique data from strings

3 Answers3

Linked