Python: Extract unique sentences from string and place them in new column concatenated with ;

Question

Afternoon Stackoverflowers,

I have been challenged with some extract/, as I am trying to prepare data for some users. As I was advised, it is very hard to do it in SQL, as there is no clear pattern, I tried some things in python, but without success (as I am still learning python).

Problem statement:

My SQL query output is either excel or text file (depends on how I publish it but can do it both ways). I have a field (fourth column in excel or text file), which contains either one or multiple rejection reasons (see example below), separated by a comma. And at the same time, a comma is used within errors (sometimes).

Field example without any modification

INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]

Desired output:

Invalid Address information; Company Id on the Invoice does not match with the Company Id registered for the Code in System; Incorrect Purchase order

Python code:

import pandas

excel_data_df = pandas.read_excel('rejections.xlsx')

# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())

excel_data_df['Invoice_Issues'].split(":", 1)

Output:

INVOICE_CREATION_FAILED[Invalid Address information:

I tried split string, but it doesn't work properly. It deletes everything after the colon, which is acceptable because it is coded that way, however, I would need to trim the string of the unnecessary data and keep only the desired output for each line. I would be very thankful for any code suggestions on how to trim the output in a way that I will extract only what is needed from that string - if the substring is available.

In the excel, I would normally use list of errors, and nested IFs function with FIND and MATCH. But I am not sure how to do it in Python...

Many thanks,

Greg

Hi Greg, is there a finite list of errors that could be present? You could list the string value of those, and then construct a list for each row. Currently you are limited because you're restricting your output to one string.split(), i.e. using the second parameter in .split(":", 1). — spen.smith, Jun 22 '21 at 16:31
Hi Smith, it is a very particular case, but normally I have a list of errors, and the complete list contains 39 unique errors. E01 = "Invalid Address information", E02 = "Incorrect Purchase order", E03 = "Invalid VAT ID" ....... E39 = "No tax line available. And each invoice can have one of them/all of them/few of them. I would like to avoid split option, and try more something with IF error E01 is within the string, copy only the "Invalid Address information". — GregLu, Jun 22 '21 at 16:43
Yes, create a dictionary of the errors and what you would want them to map to, then iterate over the row and use the dictionary to see if that key is present. — spen.smith, Jun 22 '21 at 16:52
Could you please indicate how can I do the dictionary of errors? And for the part " to see if that key is present ", could you please elaborate? I am very new to python. — GregLu, Jun 22 '21 at 16:56
Hi Smith! First of all, you are heaven-sent. Apologies for late reply, but it took me some time to exchange data-frame for the excel itself with path etc. What I would like to ask you as enhancement, would it be possible to in the end 1. Change the delimiter in the excel_data_df['Errors'] - because it uses comma and I would like it to be ";". 2. Secondly, I struggle with the option save the output as another workbook. Would you guide me to resolve it? — GregLu, Jun 25 '21 at 14:40

spen.smith · Answer 1 · 2021-06-25T21:38:20.900

This isn't the fastest way to do this, but in Python, speed is rarely the most important thing.

Here, we manually create a dictionary of the errors and what you would want them to map to, then we iterate over the values in Invoice_Issues, and use the dictionary to see if that key is present.

import pandas

# create a standard dataframe with some errors
excel_data_df = pandas.DataFrame({
    'Invoice_Issues': [
        'INVOICE_CREATION_FAILED[Invalid Address information: Company Name, Limited: Blueberry Street 10+City++09301+SK|SK000001111|BLD at line 1 , Company Id on the Invoice does not match with the Company Id registered for the Code in System: [AL12345678901]|ABC1D|DL0000001 at line 2 , Incorrect Purchase order: VTB2R|ADLAVA9 at line 1 ]']
})

# build a dictionary 
# (just like "words" to their "meanings", this maps 
#  "error keys" to their "error descriptions".
errors_dictionary = {
    'E01': 'Invalid Address information',
    'E02': 'Incorrect Purchase order',
    'E03': 'Invalid VAT ID',
    # ....
    'E39': 'No tax line available'
}


def extract_errors(invoice_issue):
    # using the entry in the Invoice_Issues column, 
    # then loop over the dictionary to see if this error is in there.
    # If so, add it to the list_of_errors.
    list_of_errors = []
    for error_number, error_description in errors_dictionary.items():
        if error_description in invoice_issue:
            list_of_errors.append(error_description)
    return ';'.join(list_of_errors)
    

# print whole sheet data
print(excel_data_df['Invoice_Issues'].tolist())


# for every row in the Invoice_Isses column, run the extract_errors function, and store the value in the 'Errors' column.
excel_data_df['Errors'] = excel_data_df['Invoice_Issues'].apply(extract_errors)

# display the dataframe with the extracted errors
print(excel_data_df.to_string())
excel_data_df.to_excel('extracted_errors.xlsx)

Hey Smith! Just wanted to tell you the response is valid and the functionality works:) thx Greg! — GregLu, Aug 04 '21 at 09:04

Python: Extract unique sentences from string and place them in new column concatenated with ;

1 Answers1