7

It seems that there is no way of obtaining input tables (from html / xls / etc files) to DataFrame objects as it is 1-to-1 without any field conversions applied internally by pandas.

Assume the following html table saved with the extension of .xls file, how would we get the same representation of this table in Python memory with DataFrame object?

The content of "test_file.xls":

<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>

(1) Straightforward reading of the file

Processing code:

import pandas

df = pandas.read_html('test_file.xls')
print(df[0])
print(df[0].dtypes)

Output:

     Number
0      1.32
1  60000.00

Number    float64
dtype: object

As we can see the numbers were converter to float64 with some predefined logic. I think this logic includes locales settings, maybe some rules inside pandas, etc. Specifying string convertors directly doesn't allow to obtain the initial values.

(2) Applying str function as a convertor for each dimension

Processing code:

converters = {column_name: str for column_name in df[0].dtypes.index}
df = pandas.read_html(f, converters = converters)
print(df[0])
print(df[0].dtypes)

Output:

    Number
0  1.32000
1    60000

Number    object
dtype: obje

Obviously, the expected output of this problem is:

     Number
0  1.320,00
1    600,00

There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? And are there any other approaches how to handle such file's processing? Thanks guys!

Remarks: Specification of "decimal" and "thousands" parameters for pandas.read_* doesn't look like a reliable solution because it is appled for all fields. Quick example: it can treat date fields in "02.2017" format as numbers and convert it to "022017".

John Coleman
  • 51,337
  • 7
  • 54
  • 119
Pleeea
  • 362
  • 4
  • 12
  • I guess `read_html` should take a `dtype` argument like `read_csv` does. – Stop harming Monica Nov 16 '17 at 11:35
  • 2
    @Goyo, unfortunately it doesn't ... There is [an issue on Pandas GitHub and it is still open](https://github.com/pandas-dev/pandas/issues/10534) – MaxU - stand with Ukraine Nov 16 '17 at 11:37
  • The input value of `,,,2,,,,5,,,,,5,,,,0,,,.,,,7,,,7,,,` (mind the dote!) also converts to `2550.77`. – Pleeea Nov 16 '17 at 11:47
  • @Goyo, for Pandas 0.21.0 it gives me `TypeError: read_html() got an unexpected keyword argument 'dtype'` – MaxU - stand with Ukraine Nov 16 '17 at 12:09
  • I removed the `processing` tag since it is for the programming language called `Processing`, and is not a generic tag for programs which involve file processing. Please read a tag description before using it. – John Coleman Nov 16 '17 at 12:11
  • John, thank a lot! Sorry for confusion with this! – Pleeea Nov 16 '17 at 12:20
  • @MaxU I know, what I meant is that it should be fixed. – Stop harming Monica Nov 16 '17 at 12:27
  • The workaround if you have `100.000,00` formatted numerical values and `01.12.2017` dates is the following: using `decimal = ','`, `thousands = '.'` and passing the convertor dictionary that maps all columns to str: `converters = {column_name: str for column_name in df[0].dtypes.index}` in read_html call. So the numbers will be correct (according to this format) and dates won't be changed to something like `1122017` (remember that leading zero might be removed!) – Pleeea Nov 16 '17 at 12:57

2 Answers2

1

You must specify your thousands and decimal separator. this worked for me:

import pandas as pd

html = """
<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>
"""

df = pd.read_html(html, decimal=",", thousands=".")
print(df)
bravhek
  • 155
  • 5
0

Does this work?

import pandas as pd


with open('test_file.xls') as f:
    raw_html = f.read()


raw_html = raw_html.replace('"tDetail">', """tDetail">'""").replace('</td>', "'</td>")


df = pd.read_html(raw_html)

Returns

       Number
0  '1.320,00'
1    '600,00'
kait
  • 1,327
  • 9
  • 13