0

I've a file which have integers in first two columns. File Name : file.txt

col_a,col_b
1001021,1010045
2001021,2010045
3001021,3010045
4001021,4010045 and so on

Now using python, i get a variable var_a = 2002000.

Now how to find the range within which this var_a lies in "file.txt".

Expected Output : 2001021,2010045

I have tried with below,

With open("file.txt","r") as a:
    a_line = a.readlines()
    for line in a_line:
        line_sp = line.split(',')
        if var_a < line_sp[0] and var_a > line_sp[1]:
            print ('%r, %r', %(line_sp[0], line_sp[1])

Since the file have more than million of record this make it time consuming. Is there any better way to do the same without a for loop.

Praveen P
  • 57
  • 5
  • No, there's no way to avoid looping through the lines of the file. If the values are in ascending order, you could `break` out of the loop as soon as you encounter one that bigger than what you're looking for. If they aren't sorted, it might be worthwhile to sort them first (since you're reading the entire file into memory). – martineau Dec 13 '19 at 08:57
  • I suggest using a specialized library for this, like csv or Pandas. – AMC Dec 13 '19 at 10:43

3 Answers3

1

Since the file have more than million of record this make it time consuming. Is there any better way to do the same without a for loop.

Unfortunately you have to iterate over all records in file and the only way you can archive that is some kind of for loop. So complexity of this task will always be at least O(n).

Maciej M
  • 786
  • 6
  • 17
1

It is better to read your file linewise (not all into memory) and store its content inside ranges to look them up for multiple numbers. Ranges store quite efficiently and you only have to read in your file once to check more then 1 number.

Since python 3.7 dictionarys are insert ordered, if your file is sorted you will only iterate your dictionary until the first time a number is in the range, for numbers not all all in range you iterate the whole dictionary.

Create file:

fn = "n.txt"

with open(fn, "w") as f: 
    f.write("""1001021,1010045
2001021,2010045
3001021,3010045

garbage
4001021,4010045""")

Process file:

fn = "n.txt"

# read in
data = {}

with open(fn) as f:
    for nr,line in enumerate(f):
        line = line.strip()
        if line:
            try:
                start,stop = map(int, line.split(","))
                data[nr] = range(start,stop+1)
            except ValueError as e:
                pass # print(f"Bad data ({e}) in line {nr}")


look_for_nums = [800, 1001021, 3001039, 4010043, 9999999]

for look_for in look_for_nums:
    items_checked = 0
    for nr,rng in data.items():
        items_checked += 1
        if look_for in rng:
            print(f"Found {look_for} it in line {nr} in range: {rng.start},{rng.stop-1}", end=" ")
            break
    else:
        print(f"{look_for} not found")
    print(f"after {items_checked } checks")    

Output:

800 not found after 4 checks
Found 1001021 it in line 0 in range: 1001021,1010045 after 1 checks
Found 3001039 it in line 2 in range: 3001021,3010045 after 3 checks
Found 4010043 it in line 5 in range: 4001021,4010045 after 4 checks
9999999 not found after 4 checks

There are better ways to store such a ranges-file, f.e. in a tree like datastructure - research into k-d-trees to get even faster results if you need them. They partition the ranges in a smarter way, so you do not need to use a linear search to find the right bucket.

This answer to Data Structure to store Integer Range , Query the ranges and modify the ranges provides more things to research.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
0

Assuming each line in the file has the correct format, you can do something like following.

var_a = 2002000
with open("file.txt") as file:
    for l in file:
        a,b = map(int, l.split(',', 1))  # each line must have only two comma separated numbers
        if a < var_a < b:
            print(l)  # use the line as you want
            break  # if you need only the first occurrence, break the loop now

Note that you'll have to do additional verifications/workarounds if the file format is not guaranteed.

Obviously you have to iterate through all the lines (in the worse case). But we don't load all the lines into memory at once. So as soon as the answer is found, the rest of the file is ignored without reading (assuming you are looking only for the first match).

Anubis
  • 6,995
  • 14
  • 56
  • 87