How to Extract All of the Text out of Tables Inside of a Slide-show Presentation
The following code extracts text from tables in a slide-show presentation. Text in the presentation outside of tables is omitted, but you can modify my code to capture text from non-table objects as well.
import pptx as pptx
from pptx import *
def get_tables_from_presentation(pres):
"""
The input parameter `pres` should receive
an object returned by `pptx.Presentation()`
EXAMPLE:
```
import pptx
p = "C:\\Users\\user\\Desktop\\power_point_pres.pptx"
pres = pptx.Presentation(p)
tables = get_tables_from_presentation(pres)
```
"""
tables = list()
for slide in pres.slides:
for shp in iter(slide.shapes):
if shp.has_table:
table = shp.table
tables.append(table)
return tables
def iter_to_nonempty_table_cells(tbl):
"""
:param tbl: 'pptx.table.Table'
input table is NOT modified
:return: return iterator to non-empty rows
"""
for ridx in range(sum(1 for _ in iter(tbl.rows))):
for cidx in range(sum(1 for _ in iter(tbl.columns))):
cell = tbl.cell(ridx, cidx)
txt = type("")(cell.text)
txt = txt.strip()
if len(txt) > 1:
yield txt
# establish read path
in_file_path = "C:\\Users\\user\\Desktop\\power_point_pres.pptx"
# Open slide-show presentation
pres = Presentation(in_file_path)
# extract tables from slide-show presentation
tables = get_tables_from_presentation(pres)
for tbl in tables:
it = iter_to_nonempty_table_cells(tbl)
print("".join(it))
A Note About One of the Other Answers to This Question
Someone else posted a semi-useful answer to this question written in pseudo-code. They wrote the following:
For r = 1 to tbl.rows.count
For c = 1 to tbl.columns.count
tbl.cell(r,c).Shape.Textframe.Text
The problem is, that is not python.
In python, it is illegal syntax to write For r = 1 to 10
Instead, we would write something like the following:
for r in range(1, 11):
print(r)
from itertools import *
for r in takewhile(lambda k: k <= 10, count(1)):
print(r)
Additionally, the row indicies start at r = 0
not r = 1
The upper-left corner of the table is tbl.cell(0,0)
not tbl.cell(1,1)
There is no such thing as .count
for the rows attribute or the columns attribute. (For r = 1 to tbl.rows.count)
makes no sense because there is no such thing as tbl.rows.count
tbl.cell(r,c).Shape
won't work, because objects instantiated from the class pptx.table._Cell
have no attribute named Shape
cell
objects have the following attributes:
fill
is_merge_origin
is_spanned
margin_bottom
margin_left
margin_right
margin_top
merge
part
span_height
span_width
split
text
text_frame
vertical_anchor
A fix is shown below:
# ----------------------------------------
# BEGIN SYNTACTICALLY INCORRECT CODE
# ----------------------------------------
# For r = 1 to tbl.rows.count
# For c = 1 to tbl.columns.count
# tbl.cell(r,c).Shape.Textframe.Text
# ----------------------------------------
# END SYNTACTICALLY INCORRECT CODE
# BEGIN SYNTACTICALLY CORRECT CODE
# ----------------------------------------
for r in range(sum(1 for row in iter(tbl.rows))):
for c in range(sum(1 for _ in iter(tbl.columns))):
print(tbl.cell(r,c).text)
# ----------------------------------------
# END SYNTACTICALLY CORRECT CODE
# ----------------------------------------
A Note About your Original Code
The continue
keyword
In your original source code, you have the following for-loop:
for shape in slide.shapes:
if not shape.has_text_frame:
continue
That for-loop does not do anything.
The continue
keyword simply means "increment the loop-counter and jump to the beginning of the loop" However, there is no code after your continue
and before the end of the loop. That is, the loop would have continued anyway without you having to write continue
because it is already at the end of the loop-body.
To understand more about continue
consider the following example:
for k in [1, 2, 3, 4, 5]:
print("For k ==", k, "we have k % 2 == ", k % 2)
if not k % 2 == 0:
continue
print("For k ==", k, "we got past the `continue`")
The output is:
For k == 1 we have k % 2 == 1
For k == 2 we have k % 2 == 0
For k == 2 we got past the `continue`
For k == 3 we have k % 2 == 1
For k == 4 we have k % 2 == 0
For k == 4 we got past the `continue`
For k == 5 we have k % 2 == 1
The following three pieces of code all print the exact same messages, regardless of the use of the continue
keyword:
for k in [1, 2, 3, 4, 5]:
print(k)
for k in [1, 2, 3, 4, 5]:
print(k)
continue
for k in [1, 2, 3, 4, 5]:
print(k)
if float(k)//1 % 2 == 0:
continue