6

I have to add a table from a CSV file around 1500 rows and 9 columns, (75 pages) in a docx word document. using python-docx.

I have tried differents approaches, reading ths csv with pandas or directly openning de csv file, It cost me around 150 minutes to finish the job independently the way I choose

My question is if this could be normal behavior or if exist any other way to improve this task.

I'm using this for loop to read several cvs files and parsing it in table format

        for toTAB in listBRUTO:
            df= pd.read_csv(toTAB)
            
            # add a table to the end and create a reference variable
            # extra row is so we can add the header row
            t = doc.add_table(df.shape[0]+1, df.shape[1])
            t.style = 'LightShading-Accent1' # border
           
        
            # add the header rows.
            for j in range(df.shape[-1]):
                t.cell(0,j).text = df.columns[j]
                
            # add the rest of the data frame
            for i in range(df.shape[0]):
                for j in range(df.shape[-1]):
                    t.cell(i+1,j).text = str(df.values[i,j])
            
            #TABLE Format
            for row in t.rows:
                for cell in row.cells:
                    paragraphs = cell.paragraphs
                    for paragraph in paragraphs:
                        for run in paragraph.runs:
                            font = run.font
                            font.name = 'Calibri'
                            font.size= Pt(7)

            
            doc.add_page_break()
        doc.save('blabla.docx')

Thanks in advance

4 Answers4

4

You'll want to minimize the number of calls to table.cell(). Because of the way cell-merging works, these are expensive operations that really add up when performed in a tight loop.

I would start with refactoring this block and see how much improvement that yields:

# --- add the rest of the data frame ---
for i in range(df.shape[0]):
    for j, cell in enumerate(table.rows[i + 1].cells):
        cell.text = str(df.values[i, j])
scanny
  • 26,423
  • 5
  • 54
  • 80
  • Wow [Done] exited with code=0 in 465.551 seconds - My code [Done] exited with code=0 in 106.166 seconds - Your code almost 5 times less, I´m going to run more examples – Karendon seisysiete Jun 30 '20 at 09:57
  • Glad it's showing improvement :) Don't neglect to accept the answer that best answered your question. That's how you acknowledge the folks who took the time to respond to your question. – scanny Jun 30 '20 at 18:39
  • RangeIndex: 474 entries, 0 to 473 Data columns 10 # Column Non-Null Count Dtype --- ------ -------------- ---- 0 Host 474 non-null object 1 Enterprise474 non-null object 2 Service 474 non-null object 3 Command 474 non-null object 4 Arguments 474 non-null object 5 User Name 474 non-null object 6 Simple identity 474 non-null object 7 Process id 474 non-null int64 8 Request Time 474 non-null object 9 OS 474 non-null object – Karendon seisysiete Jul 02 '20 at 11:56
  • 5 times less, Thanks Scanny – Karendon seisysiete Jul 02 '20 at 11:57
4

python-docx walk the whole table every single time you access its "cells" property.
so you better call ".cell" as less as possible and use a cache for cells instead.
these are two examples access a table with size 3*1500:

code 1: about 150.0s

for row in table.rows:
    print('processing: {0:30s}'.format(row.cells[0].text),end='\r')

code 2: about 1.4s

clls=table._cells
for row_idx in range(len(clls)//table._column_count):
    print('processing: {0:30s}'.format( 
       clls[0 + row_idx*table._column_count].text),end='\r')

clls=table._cells in code 2 use "_cells" to process the cell-merging, so ccls[column_idx + row_idx*table._column_count].text works just as fine as table.rows[row_idx].cells[column_idx].text, and dont require table to be exactly rectangular

kztopia
  • 41
  • 1
2

For rectangular table without merged cells you can export all cells into list-of-lists structure and fill them very quickly (less then 0.5s vs 15s for ~300 lines tables with 3 columns):

from docx.table import _Cell

def get_cells_grid(table):
    cells = [[]]
    col_count = table._column_count
    for tc in table._tbl.iter_tcs():
        cells[-1].append(_Cell(tc, table))
        if len(cells[-1]) == col_count:
            cells.append([])
    return cells

cells = get_cells_grid(t)

for i in range(df.shape[0]):
    for j in range(df.shape[i]):
            cells[i][j].text = str(df.values[i, j])

Function based on table._cells() code: https://github.com/python-openxml/python-docx/blob/da75fcf01f7f322e846e2ac3e1936aedd766acc8/docx/table.py#L162

Stanislav Ivanov
  • 1,854
  • 1
  • 16
  • 22
2

Just to add my experience, if you have to create a huge table, create the whole structure first, meaning all the rows and cells you will need; and then store the cells like so

table_cells = table._cells (according to @kztopia)

And from there you can manipulate cells as you wish, merging, adding text etc... with a rather optimized fastness since you make only one call to cell()

In my use case, for a table being, in my opinion, not so big (~130rows, 8cells per row), it used to take 9sec to create the whole thing and now i'm at .5 or so.

Keep in mind that, the bigger the table, the more time it'll take to execute cell().

Nicu Tofan
  • 1,052
  • 14
  • 34
Alex
  • 448
  • 6
  • 11