Avoiding too many pandas dataframe to array conversions

Question

I have a python script that parses through the appendix of a pdf and compares the found data elements to a json file, in order to figure out which elements we are missing.

The end result is a pandas dataframe with all the information I then need to compare with the existing json structure. right now I am doing the comparison like this:

#all the processing of pdf, opening json is not included here


#Convert pandas dataframe to arrays for comparisons
tags = tagList.iloc[:, 0].array
tagName = tagList.iloc[:, 1].array
tagDescription = tagList.iloc[:, 3].array


print("\n\nNEW TAGS found:\n===============")
for idx, tagItem in enumerate(tags):
    if not (tagItem in tagListKnown):
        print("new tag: %s\t%s\t%s" % (tagItem, stripLineBreaks(tagName[idx]), tagDescription[idx]))
        numNewTags +=1
print("\n%d\tTags total\n%d\tknown tags\n%d\tnew tags" % (len(tags),len(known_tags_from_json), numNewTags))

I use the tags array to find out which indexes are not in the known list, and then I want to make a new json file with that, but it seems to me like I am overcomplicating things. Is there a way to address the pandas dataframe directly with the index i found, and even to avoid the first array comparison.

(I am aware that this code is most likely not very pythonic)

Of course, you need to convert pdf (tabula-py) and json ( pandas.read_json or pandas.json_normalize) into two pandas tables and already work with them, since vector operations, on large amounts of data, outperform cycles. — Сергей Кох, Sep 23 '22 at 12:31

Avoiding too many pandas dataframe to array conversions

0 Answers0