0

I'm attempting to clean an excel file prior to sending it up to the database for calculations. By default when the Excel Report is exported out of our system (NextGen) it attaches a row that calculates a Sum Total of data throughout the report based on different locations.

I'm wanting to write a python script that'll go through and remove any row that begins with 'Totals' in the 'Loc' (short for location) column.

In a smaller test I was able to use the following code to remove lines based on Cell Value:

`path = ('FILE_NAME')`
`path = path[~path['Loc'].astype(str).str.contains('^Total')]
`

When Loc has a Cell that starts with Total (full cell value would be Total's of [LOCATION]) I want it to remove the line. I have a follow up block of code that removes blank rows, so that's fine.

The error code is: TypeError: string indices must be integers

Thank you for any and all assistance.

EDIT:

Below is the entire code with specific files names stripped. The System I'm exporting the reports from exports into .XLS

#-----------------------------------------------------------------------------------------------------------------------------------


# Purpose:  The purpose of this script is to pull quantitative and qualitative data 
# and clean said data prior to the report being sent to the Microsoft SQL Server for processing.
# The process will pull from file - convert file format - clean spreadsheet - format into SQL - send to DB


#-----------------------------------------------------------------------------------------------------------------------------------


import pandas as pd
import openpyxl
import time

# This script requires the openpyxl library to be installed
# https://openpyxl.readthedocs.io/en/stable/

#time library is used to add load time
#-----------------------------------------------------------------------------------------------------------------------------------

print('Begin Data Cleaning...')
# read an excel file and convert 
# into a dataframe object
# Specifically Convert .XLS into .XLSX 
# NextGen exports as Excel 93-03 .XLS, we need it to be .xlsx for ease of use


read_file = pd.read_excel('IMPORTED_FILE.xls')
#input name and location of base file


#-----------------------------------------------------------------------------------------------------------------------------------
#print("-" * 20) this is used to add a break between text blocks on the IDE. Strictly for organizational methods and documentation sake

print("-" * 20)
read_file.to_excel('CONVERTED_FILE.XLSX',
                  index = False,
                  header = True)
#input changed file format here


#-----------------------------------------------------------------------------------------------------------------------------------


#Set file path here // Gives file address alias of "PATH"

path = ('CONVERTED_FILE.XLSX')

#this block counts the rows before the edit
wb_obj = openpyxl.load_workbook(path)
sheet_obj = wb_obj.active
print('Rows Initial: ', sheet_obj.max_row)
print('Columns Initial: ', sheet_obj.max_column)

max_col = sheet_obj.max_column
print("-" * 20)


#loader
time.sleep(1)


#-----------------------------------------------------------------------------------------------------------------------------------


#this block of code will clean up the auto titles that Next Gen makes
wb = openpyxl.load_workbook(path)
sheet = wb['Sheet1']
status = sheet.cell(sheet.min_row, 1).value
#Replease the number BELOW to change the amount of first rows that are deleted
sheet.delete_rows(sheet.min_row, 4)
#Replease the number ABOVE to change the amount of first rows that are deleted
wb.save(path)


#-----------------------------------------------------------------------------------------------------------------------------------


#this block counts the rows after the edit
wb_obj = openpyxl.load_workbook(path)
sheet_obj = wb_obj.active
print('Rows After Sweep Alpha: ', sheet_obj.max_row)
print('Columns After Sweep Alpha: ', sheet_obj.max_column)

max_col = sheet_obj.max_column
#Alpha pass is first pass


#loader
time.sleep(1)


#-----------------------------------------------------------------------------------------------------------------------------------


print("-" * 20)

#This block removes blank rows
def remove(sheet):
  # iterate the sheet by rows
  for row in sheet.iter_rows():
 
    # all() return False if all of the row value is None
    if not all(cell.value for cell in row):
 
      # detele the empty row
      sheet.delete_rows(row[0].row, 1)
 
      # recursively call the remove() with modified sheet data
      remove(sheet)
      return
print('Removing Null rows from Alpha') 
  


#-----------------------------------------------------------------------------------------------------------------------------------


print("-" * 20)

#This block is loading the file into and editable style.
#This is needed because Column A is undefined and needs a title
#the code here will be naming cell A1 as "Location" = "LOC'
wb = openpyxl.load_workbook(path)
ws = wb['Sheet1']
ws['A1'] = 'Loc'
print('Null Column (A1) has been updated to: '+ws['A1'].value)
wb.save(path)
#this saves the worksheet with the new column name



#-----------------------------------------------------------------------------------------------------------------------------------


#this blocl will adjust the column dimesions
#it's crucial so the machine can work within each value
#will need to add more columns for future project
#yes, we could also just do something like A:Z, but at current state, I'm not sure on how many columns will
#be used, so I want to hold off on simplifying that. this way I can see everythinhg more clearly. adding that to future bug fixes
def format_col_width(ws):
    ws.column_dimensions['A'].width = 20
    ws.column_dimensions['B'].width = 20
    ws.column_dimensions['C'].width = 20
    ws.column_dimensions['D'].width = 20
    ws.column_dimensions['E'].width = 20
    ws.column_dimensions['F'].width = 20
    ws.column_dimensions['G'].width = 20
    ws.column_dimensions['H'].width = 20
    ws.column_dimensions['I'].width = 20
    ws.column_dimensions['J'].width = 20
    ws.column_dimensions['K'].width = 20
    ws.column_dimensions['L'].width = 20
    ws.column_dimensions['M'].width = 20
wb.save(path)

print("-" * 20)


#-----------------------------------------------------------------------------------------------------------------------------------


#This block will clean up any empty rows that may have fallen into the chart
print('Removing Null Rows from Bravo')
def remove(sheet, row):
    # iterate the row object
    for cell in row:
          # check the value of each cell in
        # the row, if any of the value is not
        # None return without removing the row
        if cell.value != None:
              return
    # get the row number from the first cell
    # and remove the row





#-----------------------------------------------------------------------------------------------------------------------------------


print("-" * 20)
#This block will clean up any Total Rows in the chart
print('PLACEHOLDER FOR SUM ROW REMOVAL')
                                       
                                       
#-----------------------------------------------------------------------------------------------------------------------------------


print("-" * 20)

#this block counts the rows after the edit
wb_obj = openpyxl.load_workbook(path)
sheet_obj = wb_obj.active
print('Rows After Sweep Bravo: ', sheet_obj.max_row)
print('Columns After Sweep Bravo: ', sheet_obj.max_column)
wb.save('wb')

max_col = sheet_obj.max_column
#Bravo pass is second pass




#-----------------------------------------------------------------------------------------------------------------------------------


print("-" * 20)

#this block will print the name of each column
print('Column Names After:')
for i in range(1, max_col + 1):
    cell_obj = sheet_obj.cell(row = 1, column = i)
    #replace the row = number with the number of rows down your titles start
    print(cell_obj.value) 
wb.save(path)




    
print("-" * 20)
print('End Program')
#-----------------------------------------------------------------------------------------------------------------------------------
jarodmwk
  • 1
  • 4

2 Answers2

0

It seems easiest to use the csv lib for this task (you can export your excel file to this format or possibly even read it directly), and create a new output file (as editing a file while reading it is error-prone). The next code block reads every row and writes it to the output file unless the "Loc" column value is "Total".

import csv

input_file = "input.csv"
output_file = "output.csv"

with open(input_file, "r") as inp:
    reader = csv.reader(inp)
    headers = next(reader)
    loc_col = headers.index("Loc")
    with open(output_file, "w") as out:
        writer = csv.writer(out)
        writer.writerow(headers)
        for line in reader:
            if line[loc_col] != "Total":
                writer.writerow(line)
Tzahi T
  • 346
  • 3
  • 11
0

Solved it using pandas and data frames. Instead of doing

Path = path[~path['Loc'].astype(str).str.contains('^Total')]

I did:

df = pd.read_excel(r'C:\\Users\'FILE_LOCATION', skiprows=3)

df = df[~df['Loc'].astype(str).str.contains('^Total')]

df.to_excel(r'C:\\Users\'FILE_LOCATION.xlsx')

This worked perfectly for what I needed.

I didn't realize I needed to put it into a df/panda set instead of just working directly into the file.

Note: I kept the file format as .xlsx for this and immediately followed up with

df.to_csv(r'C:\\Users\FILE_LOCATION.csv')

This changed the file format to a .csv for input into my MS SQL Server. Also, the ^ acts as a wildcard so any cell in the LOC row that starts with "Total" (ie: Totals) will be dropped.

My report writing software exports into .xls, I converted into .xlsx for cleaning, then converted into .csv for export.

For my specific project this works the best. I have no doubt down the road I'll come across a different solution or a better way to do this. For now, however, this works well.

Thank you everyone for the help.

jarodmwk
  • 1
  • 4