8

I have a word file (.docx) with table of data, I am trying to create a pandas data frame using that table, I have used docx and pandas module. But I could not create a data frame.

from docx import Document
document = Document('req.docx')
for table in document.tables:
    for row in table.rows:       
        for cell in row.cells:        
            print (cell.text)

and also tried to read table as df pd.read_table("path of the file")

I can read the data cell by cell but I want to read the entire table or any particular column. Thanks in advance

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Pyd
  • 6,017
  • 18
  • 52
  • 109
  • A word file of all things! Why did `pd.read_table` not work? – cs95 Dec 26 '17 at 10:45
  • No I am getting `CParserError: Error tokenizing data. C error: Expected 2 fields in line 6, saw 5` Am I missing something ? – Pyd Dec 26 '17 at 10:47
  • Could be... I don't know how your data is delimited.... whitespace? Use `delim_whitespace=True` and `engine='python'`. – cs95 Dec 26 '17 at 10:47
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ, i don't think we can use `read_table` for reading directly from a Word Document (.docx)... – MaxU - stand with Ukraine Dec 26 '17 at 10:49
  • 1
    @MaxU Oh, alright! Didn't know that. Guess OP will have to iteratively process each cell and do it then. – cs95 Dec 26 '17 at 10:50
  • Is there any other way to read the table @MaxU – Pyd Dec 26 '17 at 10:58
  • @pyd, it's going to be hard to support also legacy `.doc` (Word 97 - 2003) format. I'd suggest you [to convert such files to `.docx` format first](https://softwarerecs.stackexchange.com/questions/11687/library-for-converting-microsoft-doc-to-docx-python)... – MaxU - stand with Ukraine Dec 27 '17 at 12:34
  • Great, thanks I'll read that – Pyd Dec 27 '17 at 12:54
  • @MaxU , I tried the code you mentioned above for converting `.doc to .docx` i am getting `FileNotFoundError: [WinError 2] The system cannot find the file specified` in windows but in ubuntu it works fine. why it is happening – Pyd Jan 02 '18 at 11:24

1 Answers1

16

docx always reads data from Word tables as text (strings).

If we want to parse data with correct dtypes we can do one of the following:

  • manually specify dtype for all columns (not flexible)
  • write our own code to guess correct dtypes (too difficult and , Pandas IO methods do it well)
  • convert data into CSV format and let pd.read_csv() guess/infer correct dtypes (I've chosen this way)

Many thanks to @Anton vBR for improving the function!


import pandas as pd
import io
import csv
from docx import Document

def read_docx_tables(filename, tab_id=None, **kwargs):
    """
    parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)

    Parameters:
        filename:   file name of a Word Document

        tab_id:     parse a single table with the index: [tab_id] (counting from 0).
                    When [None] - return a list of DataFrames (parse all tables)

        kwargs:     arguments to pass to `pd.read_csv()` function

    Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
    """
    def read_docx_tab(tab, **kwargs):
        vf = io.StringIO()
        writer = csv.writer(vf)
        for row in tab.rows:
            writer.writerow(cell.text for cell in row.cells)
        vf.seek(0)
        return pd.read_csv(vf, **kwargs)

    doc = Document(filename)
    if tab_id is None:
        return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
    else:
        try:
            return read_docx_tab(doc.tables[tab_id], **kwargs)
        except IndexError:
            print('Error: specified [tab_id]: {}  does not exist.'.format(tab_id))
            raise

NOTE: you may want to add more checks and exception catching...

Examples:

In [209]: dfs = read_docx_tables(fn)

In [210]: dfs[0]
Out[210]:
   A   B               C,X
0  1  B1                C1
1  2  B2                C2
2  3  B3  val1, val2, val3

In [211]: dfs[0].dtypes
Out[211]:
A       int64
B      object
C,X    object
dtype: object

In [212]: dfs[0].columns
Out[212]: Index(['A', 'B', 'C,X'], dtype='object')

In [213]: dfs[1]
Out[213]:
   C1  C2          C3    Text column
0  11  21         NaN  Test "quotes"
1  12  23  2017-12-31            NaN

In [214]: dfs[1].dtypes
Out[214]:
C1              int64
C2              int64
C3             object
Text column    object
dtype: object

In [215]: dfs[1].columns
Out[215]: Index(['C1', 'C2', 'C3', 'Text column'], dtype='object')

parsing dates:

In [216]: df = read_docx_tables(fn, tab_id=1, parse_dates=['C3'])

In [217]: df
Out[217]:
   C1  C2         C3    Text column
0  11  21        NaT  Test "quotes"
1  12  23 2017-12-31            NaN

In [218]: df.dtypes
Out[218]:
C1                      int64
C2                      int64
C3             datetime64[ns]
Text column            object
dtype: object
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 3
    This logic works for writing **Powerpoint (pptx) files to a DataFrame** as well. Thanks!! – S3DEV Oct 14 '19 at 12:54