I have a python script that parses through the appendix of a pdf and compares the found data elements to a json file, in order to figure out which elements we are missing.
The end result is a pandas dataframe with all the information I then need to compare with the existing json structure. right now I am doing the comparison like this:
#all the processing of pdf, opening json is not included here
#Convert pandas dataframe to arrays for comparisons
tags = tagList.iloc[:, 0].array
tagName = tagList.iloc[:, 1].array
tagDescription = tagList.iloc[:, 3].array
print("\n\nNEW TAGS found:\n===============")
for idx, tagItem in enumerate(tags):
if not (tagItem in tagListKnown):
print("new tag: %s\t%s\t%s" % (tagItem, stripLineBreaks(tagName[idx]), tagDescription[idx]))
numNewTags +=1
print("\n%d\tTags total\n%d\tknown tags\n%d\tnew tags" % (len(tags),len(known_tags_from_json), numNewTags))
I use the tags array to find out which indexes are not in the known list, and then I want to make a new json file with that, but it seems to me like I am overcomplicating things. Is there a way to address the pandas dataframe directly with the index i found, and even to avoid the first array comparison.
(I am aware that this code is most likely not very pythonic)