0

I am using the eldar package in Python to write boolean queries. The dataframes on which I am running my queries contain a lot of empty cells, which are read as NoneType objects. Eldar, however, raises an error as one of the operations it performs is to set all cases to "lower", so it expects a string.

The error notification reads:

'NoneType' object has no attribute 'lower'

I have thus tried the following:

  1. Make sure that the input dataframe df is read as string:

df = pd.read_excel(file, sheet_name='FactoidList', axis=1, ignore_index=False, sort=False, dtype=str)

  1. Replace empty fields for None (two options):

df = df.where((pd.notnull(df)), None) df = df.replace({np.nan: None})

  1. Drop "na":

dropna(inplace = True)

These solutions are based on similar issues I found in various forums, but none of them do the trick. My empty cells are either still read as NoneTpye, or the entire data frame is empty.

If I was only running eldar on one column at a time, I could easily exclude empty cells that are empty, but I want to integrate eldar into a numpy condition list to check several columns per row.

My latest attempt to work with try and except unfortunately excludes the entire column from being considered if one cell is empty.

The relevant section of my much longer script looks like this:

# read all excel files in directory as one data frame

frame_list=[]
for item in os.listdir(filenames):
    file = os.path.join(filenames, item)
    df = pd.read_excel(file, sheet_name='FactoidList', axis=1, ignore_index=False, sort=False, dtype=str)
    df = df.replace({np.nan: None})
    #df = df.where((pd.notnull(df)), None) # replace empty fields for None
    frame_list.append(df)

f = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)
    
# read factoids from data frame

pers_f=(f[['pers_name']]) # retrieve data from selected column
pers_list=pers_f.values.tolist() # convert data frame to sorted list
pers_list_flat=[item for sublist in pers_list for item in sublist] # flatten list
pers_unique=pers_f.drop_duplicates() # remove duplicates
pers_unique_list=pers_unique.values.tolist() # write unique values to list
    
print("\n\nYour factoid list contains", len(pers_f), "entries.") # count data in selected column

#for i in [item for sublist in pers_unique_list for item in sublist]: # count person occurrences
    #print("\n", i, " / ", "Häufigkeit:", pers_list_flat.count(i), "\n") # print name and occurrences
    
### STEP 2: LET USER SELECT SEARCH CRITERIA

print("Query format :", '("gandalf" OR "frodo") AND NOT ("movie" OR "adaptation")')
# queried names
print("Enter person names or wildcard *.")
qn=input()
# queried year  
print("Date(s):")
ex_year=input() 
# select type of time processing
print("No date selected (0), exact dates (1), data range (2), BEFORE date (3) or AFTER date (4)?")
z=input() 
# queried institution
print("Enter institutions or wildcard *:")
qi=input() 
# queried title
print("Enter person titles or wildcard *:")
qt=input() 
# queried function
print("Enter person functions or wildcard *:")
qf=input() 
# queried related person
print("Enter related persons or wildcard *:")
qr=input() 

# Eldar Queries for boolean search

eldar_n = Query(qn, ignore_case=True, ignore_accent=False, match_word=True)
print(eldar_n) # <class 'eldar.query.Query'>
print(f['pers_name'].apply(eldar_n))

try:
    eldar_i = Query(qi, ignore_case=True, ignore_accent=False, match_word=True)
    print(f['inst_name'].apply(eldar_i))
except AttributeError as er:
    print(er.args)
    pass
    
try:    
    eldar_t = Query(qt, ignore_case=True, ignore_accent=False, match_word=True)
    print(f['pers_title'].apply(eldar_t))
except AttributeError as er:
    print(er.args)
    pass

try:
    eldar_f = Query(qf, ignore_case=True, ignore_accent=False, match_word=True)
    print(f['pers_function'].apply(eldar_f))
except AttributeError as er:
    print(er.args)
    pass

try:
    eldar_r = Query(qr, ignore_case=True, ignore_accent=False, match_word=True)
    print(f['rel_pers'].apply(eldar_r))
except AttributeError as er:
    print(er.args)
    pass

I would appreciate ideas for solving the problem!

furas
  • 134,197
  • 12
  • 106
  • 148
OnceUponATime
  • 450
  • 4
  • 12
  • Have you tried `df.fillna`? Perhaps you could fill all missing values with a placeholder that doesn't throw an error (and which you'd then handle downstream). – Paddy Alton Apr 19 '22 at 11:38
  • I thought about that but was worried this would get in the way with the wildcard * in `eldar` and other special characters we are using as part of our data. We have no mail addresses, so I could put in @ as a placeholder for now. If there was a way to process the empty cells directly, however, I would very much prefer that. – OnceUponATime Apr 19 '22 at 11:40
  • The placeholder could be anything, so maybe you could make it something like "@NO DATA@" to avoid clashes? I haven't used eldar so I don't quite know your requirements. Though I wonder: is this actually an issue with their library that you might report to them? Or is this atypical usage of their code? – Paddy Alton Apr 19 '22 at 11:44
  • I have already opened an issue on their GITHUB. I think I am using eldar for more complex queries than they anticipated -- but I am also wondering if it makes sense to assume that all incoming data would be strings. I have put in "@" as a placeholder now, but I have a suspicion that that triggers some problem in the condition list as the error I am now getting is `ValueError: list of cases must be same length as list of conditions`. I will try and have a computer science colleague check this out for me and post the final solution as an anwer. I am merely a historian messing with code... – OnceUponATime Apr 19 '22 at 12:20
  • always put full error message (starting at word "Traceback") in question (not in comments) as text (not screenshot, not link to external portal). There are other useful information. – furas Apr 19 '22 at 12:34

0 Answers0