2

I have a bunch of files in a directory named with nameid_cityid.txt, nameid and cityid being the ids of name (integer(10)) and city (integer(10)) in mydata table.

While the following solution works, I am doing type conversions since fetchall fetches 'L' and the file name tuple of nameid, cityid are strings,..

If you can suggest a pythonic or more elegant way of doing the same, that will be awesome, for me and the communtiy!

What I am trying to achieve : Find those files from a directory that don't have a record in the database and then do something with that file, like parse/move/delete it.

MySQL table mydata :

nameid  cityid
15633   45632
2354    76894

Python :

for pdffile in os.listdir(filepath):
    cityid, nameid = pdffile.strip('.txt').split('_')[0], pdffile.strip('.txt').split('_')[1]    
    cursor.execute("select cityid, nameid from mydata")
    alreadyparsed = cursor.fetchall()
    targetvalues = ((str(cityid), str(nameid)) for cityid, nameid in alreadyparsed)
    if (int(cityid), int(nameid)) in alreadyparsed:
        print cityid, nameid, "Found"
    else:
        print cityid, nameid, "Not found"
ThinkCode
  • 7,841
  • 21
  • 73
  • 92

2 Answers2

1

You could perform the concatenation in SQL, which will return a string:

SELECT CONCAT(nameid, '_', cityid, '.txt') FROM mydata
eggyal
  • 122,705
  • 18
  • 212
  • 237
  • And iterate through each file in the directory to see which one isn't found in the table by matching the .txt file? I am trying to do it in the least number of iterations. I will time them and see which one takes less time I guess. Thank you! – ThinkCode Jun 05 '12 at 16:06
  • @ThinkCode: Sorry, which are you trying to do - find files for which there is no database record, or vice versa? – eggyal Jun 05 '12 at 16:07
  • Find files which don't exist in the database. – ThinkCode Jun 05 '12 at 16:08
  • @ThinkCode: Perhaps build a single query `SELECT * FROM mydata WHERE (nameid, cityid) NOT IN ((15633,45632), (2354,76894), etc)`? – eggyal Jun 05 '12 at 16:10
  • I am trying to see if a file doesn't exist in the database and then parse that file. So, I think I should do the opposite - find if a file exists in the database and if not, then parse it. Like you said, I can also construct a list of all files in the directory and match it against the database list. I can then open the resulting file directly and parse it. – ThinkCode Jun 05 '12 at 16:15
  • @ThinkCode: the approach in my previous comment will give you a list of all the files you need to parse. You would build it by iterating over the directory. – eggyal Jun 05 '12 at 16:16
  • SELECT * FROM mydata WHERE (nameid, cityid) NOT IN ((15633,45632), (2354,76894), etc) - It should be the other way around cos' the files I am looking for shouldn't be in the file_directory_list. – ThinkCode Jun 05 '12 at 16:24
1

I'd use a set for quick and easy testing:

cursor.execute("select CONCAT(nameid, '_', cityid, '.txt') from mydata")
present = set([r[0] for r in cursor])

for pdffile in os.listdir(filepath):
    nameid, cityid = map(int, pdffile.rsplit('.', 1)[0].split('_'))
    print nameid, cityid,
    print "Found" if pdffile in present else "Not found"

First, I've pulled the query outside of the filename loop; no point in querying the same set of rows each time.

Secondly, I'll let MySQL generate filenames for me using CONCAT for ease of collecting the information into a set.

Thirdly, because we now have a set of filenames, testing each individual filename against the set is a simple pdffile in present test.

And finally, I've simplified your filename splitting logic to one line.

Now, if all you want is a set of filenames that are not present yet in the database (rather than enumerate which ones are and which ones are not), just use a set operation:

cursor.execute("select CONCAT(nameid, '_', cityid, '.txt') from mydata")
present = set([r[0] for r in cursor])

for pdffile in (set(os.listdir(filepath)) - present):
    nameid, cityid = map(int, pdffile.rsplit('.', 1)[0].split('_'))
    print nameid, cityid, "Found"

Here we use the .difference operation (with the - operator) to remove all the filenames for which there are already rows in the database, in one simple operation.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I like the set based approach. I won't even have to do this 'nameid, cityid = map(int, pdffile.rsplit('.', 1)[0].split('_'))' since we already have the desired list of files (since the existing ones are subtracted already). Thank you! – ThinkCode Jun 05 '12 at 16:27