1

I have a huge(1.5 GB) tsv (tab-separated value) file that i'm processing using python, the file is line based but it has some ill-formatted lines which i wish to skip, my code is as follows:

fo = open(output, 'w')    
with open(filename) as f:
    i = 0
    for line in f:
        print i
        try:  #to account for the ill-formatted lines
            user_hash, artist_hash, artist, playcount = line.split('\t')
            fo.write('{0}\t{1}\t{2}'.format(hash_map[user_hash], artist, playcount))
            i = i+1
        except:
            print "error in user_hash : " + user_hash
            continue

now the problem is the program terminates execution as soon as it catches the first exception, it just prints "error in user_hash" then exists. It should have continued because i know that the file has 17 million+ lines and the i only reached 433919.

Why is this happening ?

Thanks for reading.

hshihab
  • 416
  • 5
  • 16
  • 2
    Something probably happened to the file and you can't write to it anymore, but it's hard to see since you're just catching all exceptions and not displaying anything. That's why you should never use except:. Remove that and see what the exception is. – Oin Jun 25 '14 at 12:57
  • @Oin The exception that was raised was ValueError: user_hash, artist_hash, artist, playcount = line.split('\t') ValueError: need more than 3 values to unpack – hshihab Jun 25 '14 at 13:01
  • http://stackoverflow.com/questions/24053900/efficient-way-to-aggregate-and-remove-duplicates-from-very-large-password-list. i think this will be helpful – sundar nataraj Jun 25 '14 at 13:01
  • @omu_negru as this : fo = open(output, 'w') – hshihab Jun 25 '14 at 13:02
  • I would try to print out the error as well, just to see what type it is `except Error as e:` – omu_negru Jun 25 '14 at 13:05
  • @omu_negru the exception was this : user_hash, artist_hash, artist, playcount = line.split('\t') ValueError: need more than 3 values to unpack – hshihab Jun 25 '14 at 13:06
  • also..what use is the `continue` if that's the last instruction in the loop anyway ? Just remove that and print a "Done" message outside the loop...see if that gets called – omu_negru Jun 25 '14 at 13:27
  • You are right .. in my original code i didn't use it but that didn't make any difference, also "Done" gets printed outside the loop – hshihab Jun 25 '14 at 13:40

1 Answers1

0

I think i found what's causing the problem, the file that i'm reading has a lot of '^Z' characters which i think what causes the program to terminate.

So what's the best way to detect lines that contain these characters and ignore these lines while processing the file ?

hshihab
  • 416
  • 5
  • 16