1

So I've built a function that will look through all the xml files in a folder, and look for a node attribute (speaker name) and write to a row in a csv file. Note, at the moment, it appends them all to the same csv file, but I'm looking to get it to change up the file name after I've figured out the next step.

The next step that I was trying to do is to supply those speaker names from a list in a text file (I've also tried a csv file, and a list of dictionaries) and have the function applied to each of those speaker names individually.

I'm doing it with a function because I figured a for-loop iterating through a set of items within another for-loop iterating through a different set of items was kind of chancy, and a preliminary test I did with that, didn't prove that worry wrong.

When I paste in any of the items in this list individually as the argument in the function, it works. When I print the list after accessing through any of the ways I've tried, it works, I just can't seem to get the two to talk.

I've tried to apply the function to each of the items in the following way, but all it does is print out the error I gave to my except statement, and write in the header column in the csv (so I know it's at least accessing the function)

speaker_list = open("UAS_Speakers.csv","r").readlines()
for item in speaker_list:
    look_for_speaker_in_files(item) 

or

with open("speaking.txt","r") as f:
    for x in f:       
        look_for_speaker_in_files(x)

for the heck of it, I even tried to open it as a list of dictionaries since the data already had curly brackets around it. No change.

speaker_list = open("speaking.py","r") 
    for x in speaker_list:
       look_for_speaker_in_files(x)

I also, modeled on a script that I did that was taking urls from a list and performing a couple of urllib functions on them, tried this:

def main():
    with open("speaking.py","r") as speaker_list:
        for x in speaker_list:
            look_for_speaker_in_files(x)
if __name__ == "__main__":
    main()

I'm not sure if the issue is the whole list is being all fed into the function at once when I do any of these, but in case there's something wrong with the fucntion itself, preventing this from working, it's here:

def look_for_speaker_in_files(speakerAttrib):
    c = csv.writer(open("allspeakers.csv","w"))
    c.writerow(["Name", "Filename", "Text"])
    for cr_file in glob.iglob('parsed/*.xml'):
        try:
          tree = etree.parse(cr_file)
          for node in tree.iter('speaking'):
             if node.attrib == speakerAttrib:  
                c.writerow([node.attrib, cr_file, node.text])
             else:
                 continue
        except:
          print "bad string " + cr_file
          continue

Any help on this would be greatly appreciated, otherwise I'll just be stuck sorting this out by hand from OpenRefine or copy and pasting from a spreadsheet by the hundreds, and the thought of that makes my eyeballs burn.

Sample list items:

{'name': 'Mr. BEGICH'}
{'name': 'The SPEAKER pro tempore (Mr. Miller of Florida)'}
{'name': 'The Acting CHAIR'}
{'name': 'Mr. McKINLEY'}
{'quote': 'true', 'speaker': 'recorder'}
{'name': 'Mr. WAXMAN'}
{'name': 'Mr. MORAN'}
{'name': 'Mr. McKEON'}
{'quote': 'true', 'speaker': 'The Acting CHAIR'}
{'name': 'Mr. RIGELL'}
{'name': 'Mr. SMITH of Washington'}
{'name': 'Mr. KILMER'}
{'name': 'Mr. LAMBORN'}
{'name': 'Mr. CLEAVER'}
{'name': 'Mr. MICA'}
{'name': 'Ms. SPEIER'}
{'name': 'Mrs. ELLMERS'}

Sample files are in this folder: https://drive.google.com/folderview?id=0B7lGA34vOZItREhRbmF6Z3YtTnM&usp=sharing

  • Please provide us with a sample data inside the different input files you have listed in the query. – pmaniyan Apr 29 '16 at 16:48
  • Sure, just added, see above. – lisa_simpson_lp Apr 29 '16 at 17:39
  • The main problem was, you were passing the speakerAttrib as string and comparing it. I just made converted it into a dictionary variable. The second one was to get the attributes of the nodes as a list, rather than a lxml Element. – pmaniyan Apr 29 '16 at 19:32

1 Answers1

1

Please see if this works good for you.

I believe, you need to open the allspeakers.csv file in append mode, else it would be replace by each main() for iteration. Else, for each iteration you would have to write into a new file.

import csv
import glob
import ast
from os.path import isfile
from lxml import etree

def look_for_speaker_in_files(speakerAttrib):
    speakerDict = ast.literal_eval(speakerAttrib)
    l_file_exists = False
    if isfile("allspeakers.csv"):
        l_file_exists = True
    c = csv.writer(open("allspeakers.csv","a"))
    if not l_file_exists:
        c.writerow(["Name", "Filename", "Text"])
    lparser = etree.XMLParser(recover=True)
    for cr_file in glob.iglob('parsed/*.xml'):
        try:
          tree = etree.parse(cr_file,parser=lparser)
          for node in tree.iter('speaking'):
             if node.keys() == speakerDict.keys():
                c.writerow([node.attrib, cr_file, node.text])
             else:
                 continue
        except:
          print "bad string " + cr_file
          raise

def main():
    with open("UAS_speakers.txt","r") as speaker_list:
        for x in speaker_list:
            print x
            look_for_speaker_in_files(x)
if __name__ == "__main__":
    main()
pmaniyan
  • 1,046
  • 8
  • 15
  • Thanks a million, that totally works! I do actually want to have it make a different file for each speaker at some point, but was doing baby steps and at least trying to iterate through the list first. Hopefully I'll get it from there and won't be back on here when I try for that. – lisa_simpson_lp Apr 29 '16 at 22:13
  • You are welcome. I would really appreciate if you could accept my answer as the right one. Thanks. – pmaniyan Apr 29 '16 at 22:23
  • is that the check? just did that – lisa_simpson_lp Apr 29 '16 at 22:49
  • So I'm trying to get the code to start a different file for each speaker when it looks for that speaker through the files, and I'm having trouble getting it to do anything other than create files, but create ones blank except for the header, or create files that just contain all the info from the xml files in them - I put the ? in here, but since these two were related, wondering if you'd mind taking another look. http://stackoverflow.com/questions/36962273/how-to-create-different-files-for-each-time-a-function-is-performed-on-an-item-i - thanks! – lisa_simpson_lp May 01 '16 at 03:14
  • Hang on - this original code doesn't look for specific speakers, it just pops the whole thing in a csv file by speaker. Please look again – lisa_simpson_lp May 02 '16 at 13:31
  • This thread is closed now, it would be better if you ask a fresh question so that it gets the right attention – pmaniyan May 02 '16 at 14:24
  • The first answer that I had given was as per your expectation, hence you had commented that it worked. Now, since you had a different requirement, I do not see a reason on why my answer to this question was not correct. I have answered to your new requirement in the other thread. I hope you would re-consider your decision of revoking of the Best Answer to this question. Thanks. – pmaniyan May 02 '16 at 15:51