0

I'm trying to use python-pptx to extract XY scatter data for special cases where the reference excel sheet is not present and my existing VBA code fails to read the data. I can get all the y_values but I haven't been able to figure out how to get the x_values. From the documentation, I understand XY Scatter doesn't have "categories" like the other plots. However, I don't see any method or object in the chart or chart.plots[0] that will give me access to the x_values. I only see "categories" and of course, it is empty for XY Scatter.

    def get_chart_data():
    for sld in prs.slides:
        for shape in sld.shapes:
            if shape.has_chart:
                chart = shape.chart
                series_data=[]
                for series in chart.series:
                    y_val = []
                    for value in series.values:
                        y_val.append(value)
Miles Fett
  • 711
  • 4
  • 17
swalters
  • 173
  • 1
  • 5

2 Answers2

0

There's no API support for this yet, so if you want it bad enough you'll have to dive down to lxml level calls to get it.

An example of the XML for an XY-Scatter chart is show in this python-pptx analysis document: https://python-pptx.readthedocs.io/en/latest/dev/analysis/cht-xy-chart.html#xml-specimen. Here's a partial snippet:

<c:chart>
    <c:scatterChart>
      <c:ser>
        ...
        <c:xVal>
          <c:numRef>
            <c:f>Sheet1!$A$2:$A$7</c:f>
            <c:numCache>
              <c:formatCode>General</c:formatCode>
              <c:ptCount val="6"/>
              <c:pt idx="0">
                <c:v>0.7</c:v>
              </c:pt>
              <c:pt idx="1">
                <c:v>1.8</c:v>
              </c:pt>
              <c:pt idx="2">
                <c:v>2.6</c:v>
              </c:pt>
            </c:numCache>
          </c:numRef>
        </c:xVal>
        <c:yVal>
          <c:numRef>
            <c:f>Sheet1!$B$2:$B$7</c:f>
            <c:numCache>
              <c:formatCode>General</c:formatCode>
              <c:ptCount val="6"/>
              <c:pt idx="0">
                <c:v>2.7</c:v>
              </c:pt>
              <c:pt idx="1">
                <c:v>3.2</c:v>
              </c:pt>
              <c:pt idx="2">
                <c:v>0.8</c:v>
              </c:pt>
            </c:numCache>
          </c:numRef>
        </c:yVal>
      </c:ser>
      ...
    <c:scatterChart>
    ...
<c:chart>

The point data is divided between the c:xVal element and c:yVal element. You can use XPath and lxml.etree._Element calls to get it:

for series in chart.series:
    ser = series._ser
    x_pts = ser.xpath(".//c:xVal//c:pt")
    for pt in x_pts:
        print("pt.idx == %s", pt.get("idx"))
        str_value = pt.xpath("./c:v")[0].text
        value = float(str_value)
        print("value == %s" % value)

You'd need to extend this general approach to also get Y-values, match up into (x, y) pairs based on matching idx (don't rely on document order), probably discarding any that don't form full sets; I vaguely remember that can happen.

scanny
  • 26,423
  • 5
  • 54
  • 80
  • I confirmed this does works. Also after I posted the questions I did more research and found @scanny answer from 1yr ago. [link](https://github.com/scanny/python-pptx/issues/393) Would the method in the link be a better way to get the xVal's and yVal's. – swalters Oct 01 '19 at 15:57
0

I wanted to post the final code I have base on the answer from @scanny. My goal with this was to be able to loop through Power Point slides and extract data from XY Scatter plots that no longer had the linked excel sheet.

from pptx import Presentation
from pptx.chart.series import XySeries
import numpy as np

def get_chart_data(prs):
    for sld in prs.slides:
        for shape in sld.shapes:
            if shape.has_chart:
                chart = shape.chart
                series_data = {}
                series_data[shape.name] = {}
                if isinstance(chart.series[0], XySeries): #check if XY Series
                    for series in chart.series:
                        x_values, y_values = read_xy(series)

                        #create dictionary with Chart name and series names
                        series_data[shape.name][series.name] = np.array([x_values, y_values])
                    for c in series_data.keys(): #get chart keys
                        for s in series_data[c].keys(): # get series keys
                            data_final = series_data[c][s].T #retrieve XY data for given chart and series
                            np.savetxt(f'{shape.name}_{s}.csv', data_final, delimiter=',')

def read_xy(series):
    xVal = {}
    yVal = {}
    ser = series._ser
    x_pts = ser.xpath(".//c:xVal//c:pt") # get all xVals from xml with xpath query
    y_pts = ser.xpath(".//c:yVal//c:pt") # get all yVals from xml with xpath query
    for i in range(len(x_pts)): #loop through all xVals
        x_value = get_pt_val(x_pts[i]) #call function to get each x value
        y_value = get_pt_val(y_pts[i]) #call function to get each y value
        xVal[x_pts[i].idx] = x_value #store x value in dictionary
        yVal[y_pts[i].idx] = y_value # store y value in dictionary

    # in case x & y idx don't have matching pairs return keys that are common to both x & y
    key = set.intersection(*tuple(set(d.keys()) for d in [xVal, yVal]))
    xVal = [xVal[x] for x in key] #create xVal list
    yVal = [yVal[x] for x in key] #create yVal list
    return xVal, yVal

def get_pt_val(pt):
    str_value = pt.xpath("./c:v")[0].text #retrieve point value
    value = float(str_value)
    return value

if __name__ == '__main__':
    prs = Presentation('Test.pptx')
    get_chart_data(prs)

swalters
  • 173
  • 1
  • 5