Tokenising integers in a string

Question

I have a text file containing coordinates in the form of:

[-1.38795678, 54.90352965]
[-3.2115, 55.95530556] 
[0.00315428, 51.50285246]

I want to be able to iterate through each coordinate to check which polygon it is in (UK counties in a shapefile), however I am not sure how to tokenise the numbers so that I can have a code along the lines of...

for line in coordinates:
    for poly in polygons:
        if points in polygons:
            print(polygons)
            break

        if points not in polygons:
            continue

At the moment they are strings but I want to each line to be comprised of the two points so I that the program can try and locate them in a polygon.

Add the actual format of your file to your question – Padraic Cunningham Dec 01 '15 at 14:31 — Padraic Cunningham, Dec 01 '15 at 14:31

Kevin · Accepted Answer · 2015-12-01T14:34:23.797

3

You could turn the string into a tuple using literal_eval.

>>> from ast import literal_eval
>>> s = "[-1.38795678, 54.90352965], [-3.2115, 55.95530556], [0.00315428, 51.50285246]"
>>> seq = literal_eval(s)
>>> print seq[0][1]
54.90352965

Edit: if the coordinates are on separate lines with no commas,

from ast import literal_eval

s = """[-1.38795678, 54.90352965]
[-3.2115, 55.95530556]
[0.00315428, 51.50285246]"""

seq = [literal_eval(line) for line in s.split("\n")]
#or
seq = literal_eval(s.replace("\n", ","))
print seq[0][1]

edited Dec 01 '15 at 14:34

answered Dec 01 '15 at 14:13

Kevin

74,910
12
133
166

If I load if the file without reading it I get the error ValueError: malformed node or string: <_io.TextIOWrapper name='/Users/JoshuaHawley/sundayCoordinates.txt' mode='r' encoding='US-ASCII'>, or if I do read it and then do seq = literal_eval(s), it sends back the error File "", line 2 [-3.2115, 55.95530556] ^ SyntaxError: invalid syntax I just took the coordinates from tweets and then want to map densities across the UK, but this part has taken forever to try and work out. – JTH Dec 01 '15 at 14:22
1

@JoshuaHawley: it looks like your data file doesn't have commas between lines; try applying `literal_eval` to each line separately. – Hugh Bothwell Dec 01 '15 at 14:24
The file just has coordinates listed each on a new line [-1.38795678, 54.90352965] [-3.2115, 55.95530556] [0.00315428, 51.50285246] with no commas between each line – JTH Dec 01 '15 at 14:26
1

What happens if you do `seq = literal_eval(file.read().replace("\n", ","))`? – Kevin Dec 01 '15 at 14:28
1

@Kevin No errors occur when running that, the code you posted before also seemed to work. – JTH Dec 01 '15 at 14:33

Padraic Cunningham · Answer 2 · 2015-12-01T14:57:52.393

You can also use a regex which would be considerably faster than ast:

import re
with open("in.txt") as f:
    r = re.compile("[-]?\d+\.\d+")
    data = [list(map(float, r.findall(line))) for line in f]

Some timings:

In [14]: %%timeit
with open("test.txt") as f:
    data = [literal_eval(line) for line in f]
   ....: 
100 loops, best of 3: 2.01 ms per loop

In [15]: %%timeit
with open("test.txt") as f:
    r = re.compile("[-]\d+\.\d+")
    data = [list(map(float, r.findall(line))) for line in f]
   ....: 
1000 loops, best of 3: 403 µs per loop

 with open("test.txt") as f:
    r = re.compile("[-]?\d+\.\d+")
    data = [list(map(float, r.findall(line))) for line in f]
   ....:     

In [38]: with open("test.txt") as f:
           data2 = [literal_eval(line) for line in f]
   ....:     

In [39]: data == data2
Out[39]: True

Just stripping and splitting would be faster again:

In [40]: %%timeit
   ....: with open("test.txt") as f:
   ....:     data = [list(map(float, line.strip("[]\n").split(","))) for line in f]
   ....: 
1000 loops, best of 3: 249 µs per loop

Tokenising integers in a string

2 Answers2