3

As a learning project for Python, I am attempting to read all Excel files in a directory and extract the names of all the sheets.

I have been trying several available Python modules to do this (pandas in this example), but am running into an issue with most of them depending on openpyxl.

This is my current code:

import os
import pandas

directory_root = 'D:\\testFiles'

# Dict to hold all files, stats
all_files = {}

for _current_path, _dirs_in_path, _files_in_path in os.walk(directory_root):

    # Add all files to this `all_files`
    for _file in _files_in_path:
        # Extract filesystem stats from the file
        _stats = os.stat(os.path.join(_current_path, _file))

        # Add the full file path and its stats to the `all_files` dict.
        all_files[os.path.join(_current_path, _file)] = _stats

# Loop through all found files to extract the sheet names
for _file in all_files:

    # Open the workbook
    xls = pandas.ExcelFile(_file)

    # Loop through all sheets in the workbook
    for _sheet in xls.sheet_names():
        print(_sheet)

This raises an error from openpyxl when calling pandas.ExcelFile(): ValueError: Max value is 14.

From what I can find online, this is because the file contains a font family above 14. How do I read from an Excel (xlsx) file while disregarding any existing formatting?

The only potential solution I could find suggests modifying the original file and removing the formatting, but this is not an option as I do not want to modify the files in any way.

Is there another way to do this that doesn't have this formatting limitation?

Zephyr
  • 9,885
  • 4
  • 28
  • 63
  • Please upload the spreadsheet that is causing this problem. – GordonAitchJay Dec 30 '21 at 13:07
  • Does the file open in Microsoft Excel (without excel changing the file)? You could make a copy of the file and remove formatting from the copy... its formatting shouldn't affect the data in the cells unless the formatting is somehow relevant to your code. The underlying problem is that the file is malformed (does not conform to excel specification). – sytech Dec 30 '21 at 21:55

6 Answers6

5

The issue is that your file does not conform to the Open Office specification. Only certain font families are allowed. Once openpyxl encounters a font out of specification, it throws this error because OpenPyxl only allows spec-conforming excel files.

Some Excel readers may not have an issue with this and are more flexible with non-OpenOffice-spec-conforming files, but openpyxl only implements the Apache Open Office spec.

The xml being parsed will contain information about the font like this:

<font>
  <b/>
  <sz val="11"/>
  <color rgb="FF000000"/>
  <name val="Century Gothic"/>
  <family val="34"/>
</font>

If the family value is over 14, openpyxl throws this ValueError. There is an underlying descriptor in Open Office that controls this.

When other readers like, say, Microsoft Office 365 Excel encounters this, it will change the font family when loading the file to a compliant font (the default, Calibri).

As a workaround, if you don't want to change the value (as Microsoft Excel does), you can monkeypatch the descriptor to allow a larger max font family.

# IMPORTANT, you must do this before importing openpyxl
from unittest import mock
# Set max font family value to 100
p = mock.patch('openpyxl.styles.fonts.Font.family.max', new=100)
p.start()
import openpyxl
openpyxl.open('my-bugged-worksheet.xlsx') # this works now!

This can be reproduced using this excel workbook. Before the patch this will fail to load. After the patch, it loads without error.

sytech
  • 29,298
  • 3
  • 45
  • 86
  • Wow, very insightful. And strange that there is no easy option to fall back to `openpyxl.styles.fonts.DEFAULT_FONT` for instance. Instead of firing the mocking artillery, just want to add the manual possibility: `openpyxl.styles.fonts.Font.family = openpyxl.descriptors.nested.NestedMinMax(min=0, max=100)` – Jonas Hörsch Jan 01 '22 at 13:19
  • @JonasHörsch yeah, that seems like it would be a good feature for openpyxl. I did have the thought, but it wasn't clear to me how to accomplish that (through mocking or otherwise) -- but OP stated they did not want to modify their file in any way. – sytech Jan 01 '22 at 20:26
2

It is easy to detect when family value is out of range by a simple unzip|find in windows or a grep in others. So you could filter out files based on those values. here we see in the bad boy example they are acceptable 2 and an unacceptable 34

enter image description here

However since all platforms (win 10 included) have TAR it is easiest to first expand the file.xlsx as a set and using find by file in the native OS (or python) then ensure you know exactly which file needs adjusting.

enter image description here

So we now know it is styles.xml (that's not surprising as font values should be there)

and at this point we can use string replace to change that entry to say

      <family val="3"/>

if that's more useful for your purpose.

Then repack the adjusted xlsx (NOTE:- it is best to only use a tool to "update" the one style.xls file to maintain the zip relative order) and it should behave just the same as a standard.xlsx that has the standard 1-14 fonts, pre-suming the author did not introduce other errors.

LATER EDIT I wont claim to reinvent a Pythonic Wheel but simply say that from https://stackoverflow.com/a/69360331/10802527 (by a now departed user) this should work for many other interested users. Backup your files and modify accordingly.

import tempfile
from openpyxl import load_workbook
import os
import shutil
from lxml import etree


EXCELFILE = '~/Book1.xlsx'
STYLES = 'xl/styles.xml'
FORMAT = 'zip'


with tempfile.TemporaryDirectory() as tdir:
    os.chdir(tdir)
    shutil.unpack_archive(filename=EXCELFILE, format=FORMAT)
    with open(STYLES, 'r') as styles:
        tree = etree.parse(styles)
        for family in tree.xpath('//*[local-name()="fonts"]//*[local-name()="font"]//*[local-name()="family"]'):
            try:
                if int(family.attrib['val']) > 14:
                    family.set('val', '2')
            except Exception:
                pass
    with open(STYLES, 'wb') as styles:
        tree.write(styles)
    shutil.make_archive(base_name=EXCELFILE, format=FORMAT)
    shutil.move(f'{EXCELFILE}.{FORMAT}', EXCELFILE)

load_workbook(EXCELFILE)

The call to load_workbook() is done merely as a check against the modified spreadsheet's validity

K J
  • 8,045
  • 3
  • 14
  • 36
1

Here's what fixed this error for me. I edited lib\site-packages\openpyxl\descriptors\base.py and added a print statement after line 86 in class Max like so:

def __set__(self, instance, value):
    if ((self.allow_none and value is not None)
        or not self.allow_none):
        value = _convert(self.expected_type, value)
        if value > self.max:
            print(f"value is {value}")
            raise ValueError('Max value is {0}'.format(self.max))
    super(Max, self).__set__(instance, value)

This printed the value of 34 which is obviously higher than the max value of 14.
All I did to make it work was comment out the line that raises the error.
Changing the code to:

def __set__(self, instance, value):
    if ((self.allow_none and value is not None)
        or not self.allow_none):
        value = _convert(self.expected_type, value)
        if value > self.max:
            self.max = value
            # print(f"value is {value}")
            # raise ValueError('Max value is {0}'.format(self.max))
    super(Max, self).__set__(instance, value)

This solved the problem for me.
Or if you need to distribute the file and have to use the orignal library code THEN try the first answer.

# IMPORTANT, you must do this before importing openpyxl
from unittest import mock
# Set max font family value to 100
p = mock.patch('openpyxl.styles.fonts.Font.family.max', new=100)
p.start()
import openpyxl
openpyxl.open('my-bugged-worksheet.xlsx') # this works now!

Before importing openpyxl.

0

This is most probably not because of a font size or family, because it gives ValueError. What I see from this page and this page, it seems that one of your float value in excel file must not be more than 14. That's why it gives the error ValueError: Max value is 14. You may dive into the file and search for a value which is more than 14 and try your code by manipulating the value.

  • I don't see how a spreadsheet holding a value over 14 would cause anything to fail. But the font used in the spreadsheet **is** a value. From the other thread I referenced, it seems that `openpyxl` is not equipped to handle anything but the most basic fonts. Why, I don't know. But if `openpyxl` can't be used for Excel files that contain values greater than 14, that would make it completely useless, no? – Zephyr Dec 22 '21 at 17:35
  • @Zephyr Sure it would be useless. Did you try lowering the font size and loading the workbook again? If so, did it work? By the way, your refence link seems broken. – highlytrainedbadger Dec 22 '21 at 17:45
0

If I right, you want to get all of xlsx sheet name from files in a directory so you can do this :

import pandas as pd
import os
dirpth = './Target Folder/'
for dirpath, dirnames, filenames in os.walk(dirpth):
    file_names = filenames
file_names = [dirpth+file_names[i] for i in range(len(file_names))]
data = []
sheet_names = []
for names in file_names:
    df = pd.ExcelFile(names,engine = 'openpyxl')
    data_sheet = []
    sheet_temp = []
    for name in df.sheet_names:
        data_sheet.append(df.parse(nama,index_col = [0]))
        sheet_temp.append(name)
    data.append(data_sheet)
    sheet_names.append(sheet_temp)

In this way, you will get data from each sheet automatically for each excel file, but it will give error where you have file with difference extension in the same folder (for example in the same folder you have .csv file). So you need to filter all of file name first or you can use try except statement to skip non excel file. If your .py file have difference path with your folder target, just change dirpath, example : 'D:/changeYour Folder Path/Example/Target/'

Note : You need to install openpyxl

Wahyu Hadinoto
  • 198
  • 1
  • 10
0

This problem could be solved by cleaning xlsx styles completely, here is my code how to do it with pandas though openpyxl https://stackoverflow.com/a/71526058/1731460

pymen
  • 5,737
  • 44
  • 35