1

I have a .txt document with this type of text:

[(“Vazhdo”,”verb”),(“të”,”particle”),(“ecësh”,”verb”),(“!”,”excl.”)]

(which represents a sentence and the Parts of Speech tags for each word)

I want to have a list of list in python, like this:

[[(“Vazhdo”,”verb”),(“të”,”particle”),(“ecësh”,”verb”),(“!”,”excl.”)]]

But I obtain this:

['[(“Vazhdo”,”verb”),(“të”,”particle”),(“ecësh”,”verb”),(“!”,”excl.”)]\n']

The code I'm using is:

import io
f=io.open("test.txt", mode="r", encoding="utf-8-sig")
f_list = list(f)

How can I avoid the ['[ .... ]\n'] ?

Thank you!

Kena
  • 43
  • 6
  • The best solution here might to be to modify whatever *produces* that file to write, for example, proper JSON instead. Obviously, that is not always possible. – Ture Pålsson Feb 09 '22 at 06:04

3 Answers3

3

it looks like you can just do

import json
data = json.load(open('test.txt'))

this answer was wrong sorry... [("word","QQ")] is NOT valid json as json does not support tuples

instead you should be able to do

import ast
data = ast.literal_eval(io.open("test.txt", mode="r", encoding="utf-8-sig").read())

here is my version

import io,ast,requests

#text file available at
text_url = "https://gist.githubusercontent.com/joranbeasley/a50d940d9ac47e8458f027d3cc88e236/raw/3a65169d30e653e085284de16b1ee715f3596c95/example.txt"
with open("example.txt","wb") as f:
    # download and save textfile
    f.write(requests.get(text_url).content)

data = ast.literal_eval(io.open('example.txt',encoding='utf8').read())
print(data)
print(data[0])
print(data[0][0])

results in

[('Vazhdo', 'verb'), ('të', 'particle'), ('ecësh', 'verb'), ('!', 'excl.')]
('Vazhdo', 'verb')
Vazhdo
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • I tried it, but I receive this error because of the encoding: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 16: character maps to – Kena Feb 09 '22 at 05:38
  • 1
    you i guess need `io.open("test.txt", mode="r", encoding="utf-8-sig")` instead of just open :/ – Joran Beasley Feb 09 '22 at 05:40
  • 1
    It gives this error after adding it: JSONDecodeError: Expecting value: line 1 column 2 (char 1) – Kena Feb 09 '22 at 05:42
  • @BrikenaLiko yeah ... oops it was my bad (see answer edit) – Joran Beasley Feb 09 '22 at 05:55
  • What does it gives to your computer, if you replicate the txt file with the text above? Because , if it works for you, I don't know what I'm doing wrong, I just copy-pasted and I read the error: SyntaxError: invalid syntax – Kena Feb 09 '22 at 06:05
  • can you post your actual text file ... i did copy and paste and it worked fine... but it is probably encoding issues that dont manifest with copy and paste ... . if it actually has those funny quotes that might be problematic (alot of times those are introduced in copy paste errors) ... do you have any ability to change how the file is generated just slightly ... this seems like someone was trying to write json....but they did it wrong – Joran Beasley Feb 09 '22 at 06:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241849/discussion-between-brikena-liko-and-joran-beasley). – Kena Feb 09 '22 at 06:17
1

io.open() reads the file as a list of strings so you'll need to evaluate each line of the .txt file to get a list of lists instead of your list of strings.

Here's how you can accomplish that:

temp = ['[("Vazhdo","verb"),("të","particle"),("ecësh","verb"),("!","excl.")]\n']
f_list = []
for i in temp:
  f_list.append(eval(i.strip()))

print(f_list)

#[[('Vazhdo', 'verb'), ('të', 'particle'), ('ecësh', 'verb'), ('!', 'excl.')]]


#OR

f_list = [eval(lst.strip()) for lst in f_list]
0

you can delete a blank line with strip method, like:

f_list[0] = f_list[0].rstrip()
  • I tried it, it gives me the same beginning again: ['[(“Vazhdo”, – Kena Feb 09 '22 at 05:44
  • when I try it: `f_list = ['[(“Vazhdo”,”verb”),(“të”,”particle”),(“ecësh”,”verb”),(“!”,”excl.”)]\n'] print ('\n' in f_list[0]) f_list[0] = f_list[0].rstrip() print ('\n' in f_list[0])` the first output is True so there is a blank line but after strip the output is false so there is no blank line and that was deleted –  Feb 09 '22 at 05:47