10

See my code below. This code works very well, but I would like to do two things. One thing is I made if statement with or much shorter than actual for example. I have many columns like this, not all next to each other. I would like it to be shorter. Also, sometimes I may not know exact column letter.

So I want to know if there is a way to know the column name or header. Like the values that would be in very top row. So I can test to see if it is one of those values to always perform function on that cell if it's in the specified column. I can't find openpyxl function to do column name. Not sure if it understands that first row is different than rest. I think maybe if not I can try to do test on first row, but don't understand how to make this.

So is there a way to call column name? or if there is no way to call column name to test, can someone help me with doing check on first row to see if it has value? then do change on correct row I'm in? Does this make sense.

So instead of code saying:

if cellObj.column == 'H' or ...

It would say:

if cellObj.column_header == 'NameOfField or ...

Or if not possible to do that, then:

if this cell has column where first row value is 'NameOfField' ...

Please help with best way to do this. I have looked on stackoverflow and in book and blog site, but does not seem to be a way to call column name (not the letter of column).

for row in sheet.iter_rows():
 for cellObj in row:
    if cellObj.column == 'H' or cellObj.column == 'I' or cellObj.column == 'L' or cellObj.column == 'M':
        print(cellObj.value),
        if cellObj.value.upper() == 'OldValue1':
            cellObj.value = 1
            print(cellObj.value)
        elif cellObj.value.upper() == 'OldValue2':
            cellObj.value = 2
            print(cellObj.value)
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
stahna
  • 193
  • 1
  • 1
  • 9

6 Answers6

8

EDIT

Assuming these are the header names you are looking for:

colnames = ['Header1', 'Header2', 'Header3']

Find the indices for these columns:

col_indices = {n for n, cell in enumerate(sheet.rows[0]) if cell.value in colnames}

Now iterate over the remain rows:

for row in sheet.rows[1:]:
    for index, cell in enumerate(row):
         if index in col_indices:
             if cell.value.upper() == 'OldValue1':
                  cell.value = 1
                  print(cell.value)
             elif cell.value.upper() == 'OldValue2':
                 cell.value = 2
                 print(cell.value)

Use a dictionary instead of a set to keep the column names around:

col_indices = {n: cell.value for n, cell in enumerate(sheet.rows[0]) 
               if cell.value in colnames}

for row in sheet.rows[1:]:
    for index, cell in enumerate(row):
        if index in col_indices:
            print('col: {}, row: {}, content: {}'.format(
                   col_indices[index], index, cell.value))
            if cell.value.upper() == 'OldValue1':
                 cell.value = 1
            elif cell.value.upper() == 'OldValue2':
                 cell.value = 2

Old answer

This makes your if statement shorter:

if cellObj.column in 'HILM':
    print(cellObj.value),

For multi letter column coordinates you need to use a list:

if cellObj.column in ['H', 'AA', 'AB', 'AD']:
    print(cellObj.value),
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • Thank you again. I didn't think of that. I have seen that before. The problem still remains that in this case I still have to know the specific columns beforehand. But this does help. – stahna Dec 15 '15 at 18:20
  • Oh wait! What if I use that syntax but the column is AB, or AA? I don't want to change column A or column B. I tried to do this and it behaves as I would expect and seems to get confused on whether AB means A and B or column AB. So like this: if cellObj.column in 'HILMPQTUXYABAC': But this still works though: if cellObj.column in 'HILMPQTUXY' or cellObj.column == 'AB' or cellObj.column == 'AC' – stahna Dec 15 '15 at 18:23
  • FWIW there is no guarantee that `cell.column` will be around as a property so it's probably best working with `for cell in row[6:10]:…` – Charlie Clark Dec 15 '15 at 20:15
  • I've been playing around with the code you provided. Thank you. I still struggle to understand exactly what it's doing. How do I use it to test to know when I'm in the correct column? – stahna Dec 18 '15 at 19:33
2

You can use a dictionary object to store the key-value pairs for your data, where the key will be the header for each column, and the value will be the particular column value. You can then append these dictionary objects to a list and access them using a for loop and normal dictionary syntax.

For example:

Assuming "my_workbook" is an excel workbook with the following column headers and values stored in the first worksheet:

Name Class Age

John 1 12

Andrew 1 12

Jane 2 13

Load the workbook and get values only:

from openpyxl import load_workbook 
wb = load_workbook('./my_workbook.xlsx')
ws  = wb.worksheets[0].values
header = next(ws) #get the header row
my_data = []

Organise the data into a dictionary structure:

for row in ws:
    my_data.append(dict(zip(header, row))

You can then access the columns of each row using the headers as keys:

for data in my_data:
    print(data['Name'], data['Class'], data['Age'])

This will output:

John 1 12
Andrew 1 12
Jane 2 13

As a final note, using a dictionary structure to store and access your data makes your code more readable, as opposed to using indices, and allows you to re-arrange the columns in the excel file without having to modify your code. Hope this helps.

1

You can access cells from the first row and and column using the sheet.cell(row=#, column = #) syntax. For example:

for row in enumerate(sheet.iter_rows()):
    for j, cellObj in enumerate(row):
        header_cell = sheet.cell(row=1, column=j)

        if cellObj.column in ['H', 'I', 'L', 'M', 'AA', 'AB']:
            print(cellObj.value),
            if cellObj.value.upper() == 'OldValue1':
                cellObj.value = 1
                print(cellObj.value)
            elif cellObj.value.upper() == 'OldValue2':
                cellObj.value = 2
                print(cellObj.value)
tornesi
  • 269
  • 2
  • 8
  • Thank you I will try. What does this "enumerate"? – stahna Dec 15 '15 at 18:44
  • Enumerate returns the index of the object in the list as well as the value from the list. – tornesi Dec 15 '15 at 18:54
  • I'm sorry I don't understand this. I can't get it to work. I tried looking here too: http://stackoverflow.com/questions/11427138/python-wildcard-search-in-string So, if I can reference the value in first row, I need to do my if so that it will match either '*RtNL" or "*LftNL". I don't know where your 'j' goes. For tornesi code then the if should be "if header_cell == '*RtNL" ? – stahna Dec 15 '15 at 19:10
  • the regex is okay, I will try this later. I have not read about this yet. But how to use the code to reference column name in if statement still I don't understand. – stahna Dec 15 '15 at 19:16
0

Since row returns a generator, you can easily extract headers in the first iteration, treat them as you need, and then continue to consume it. For instance:

headers = [cell.value for cell in next(sheet.rows)]
# find indexes of targeted columns
cols = [headers.index(header) for header in 'HILM']

conv = {'OldValue1': 1, 'OldValue2': 2}

for row in sheet.rows:
    values = [cell.value for cell in row]
    for col in cols:
        values[col] = conv[values[col]] 
Nuno André
  • 4,739
  • 1
  • 33
  • 46
0

You have many ways to do this. some approach that i used:

1. Brute force

Assuming "sheet" and "workbook" are defined.

header = [cell for cell in sheet['A1:XFD1'][0] if cell.value is not None and cell.value.strip() != ''] #you get all non-null columns
target_values = ['NameOfField', 'NameOfField1', 'NameOfField2'] #filter list
target_header = [cell.column for cell in header if cell.value in target_values] #get column index

data = {'OldValue1': 1, 'OldValue2': 2}

for row in sheet.iter_rows(max_row=sheet.max_row, max_col=sheet.max_column):
 for cell in row:
     if cell.column in target_header and cell.value in data :
         cell.value = data[cell.value]

In this case, the brute force is in "sheet['A1:XFD1']". we have to check for all columns the first time. But you'll get all cells references for columns. After that, we create target_values (our columns names...) and we create a list with column index (target_header). Finally we iterated over sheet. We check if the cell's column is in the column index and check if the cell's value is in data, so we're able to change the value.

Downside:if exists cell with random whitespace outside "data area". max_row and max_column will consider that cells (iterate over blank cells).

2. Check for bundaries

You can use your own max row and max column if the data has table form(no empty space between columns, a column with "id"-> not null, not whitespace).

from openpyxl.utils import get_column_letter 

def find_limit_sheet(direction):
    max_limit_value = 1
    while (direction(max_limit_value).value is not None) and (direction(max_limit_value).value.strip() != ''):
        max_limit_value = max_limit_value + 1
    return (max_limit_value - 1) if max_limit_value != 1 else 1


max_qrow = find_limit_sheet(direction=lambda increment: sheet.cell(row=increment, column=1))
max_qcolumn = find_limit_sheet(direction=lambda increment: sheet.cell(column=increment, row=1))

header = [cell for cell in sheet[f'A1:{get_column_letter(max_qcolumn)}1']] #you get all non-null columns
target_values = ['NameOfField', 'NameOfField1', 'NameOfField2'] #filter list
target_header = [cell.column for cell in header[0] if cell.value in target_values] #get column names

data = {'OldValue1': 1, 'OldValue2': 2}

for row in sheet.iter_rows(max_row=max_qrow, max_col=max_qcolumn):
 for cell in row:
     if cell.column in target_header and cell.value in data :
         cell.value = data[cell.value]

In this case we are inside "data area" only.

3. Optional: Using Pandas

If you need more complex operation on excel data(i have to read a lots of excel in my work :( as data source). I prefer convert to pandas dataframe-> make operation -> save result .

In this case we use all the data.

from openpyxl.utils import get_column_letter 
import pandas as pd

def find_limit_sheet(direction):
    max_limit_value = 1
    while (direction(max_limit_value).value is not None) and (direction(max_limit_value).value.strip() != ''):
        max_limit_value = max_limit_value + 1
    return (max_limit_value - 1) if max_limit_value != 1 else 1


max_qrow = find_limit_sheet(direction=lambda increment: sheet.cell(row=increment, column=1))
max_qcolumn = find_limit_sheet(direction=lambda increment: sheet.cell(column=increment, row=1))

header = [cell.value for cell in sheet[f'A1:{get_column_letter(max_qcolumn)}1'][0]] #you get all non-null columns
raw_data = []
for row in sheet.iter_rows(max_row=max_qrow, max_col=max_qcolumn):
    row_data = [cell.value for cell in row]
    raw_data.append(dict(zip(header, row_data)))

df = pandas.DataFrame(raw_data)
df.columns = df.iloc[0]
df = df[1:]

You can also use a sub-set of columns using target_data for example 2.

...
target_header = [cell.column for cell in header[0] if cell.value in target_values] #get column names
...
raw_data = []
for row in sheet.iter_rows(max_row=max_qrow, max_col=max_qcolumn):
    row_data = [cell.value for cell in row if cell.column in target_header]
    raw_data.append(dict(zip(header, row_data)))

df = pd.DataFrame(raw_data)
df.columns = df.iloc[0]
df = df[1:]
...

INFO

Edvrsoft
  • 1
  • 1
0

I was looking for an answer to this question. I need an easy way to use the column names in the code in stead of the cumbersome column indicators 'A', 'B', etc. I found a pythonic, elegant and concise solution using namedtuples. It makes a namedtuple of the column headers and reads the data of each spreadsheet row, while mapping the value to the column name. Now you can use the dot notation to read the value of a cell.

    from collections import namedtuple
    from openpyxl import load_workbook

    wb = load_workbook(filename=sourcefile, data_only=True)
    ws = wb.active
    reader = ws.iter_rows(values_only=True)
    Data = namedtuple("Data", next(reader))
    for xls_data in map(Data._make, reader):
        value_1 = xls_data.Columnname1
        value_x = xls_data.AnotherColumnname

If you need row numbers, you can do:

    for line_num, xls_data in enumerate(map(Data._make, reader), 2):

You have to start the enumerator with 2, because the first row has been read for the column names with 'next(reader)'.

For the namedtuple to work without errors, the headers must be valid identifiers: no dashes or spaces in the headers.

ambrox
  • 11
  • 4