1

I am attempting to read through ~ 100 powerpoint slides and read the notes sections of each slide. I will do some text wrangling and write to csv after the fact, but need to get the notes in a workable format first.

I am working with the officer package, read_pptx function right now, but am open to whatever packages needed. It doesn't seem to pull in notes, but I may just be looking at this wrong.

To show a bit of what I've tried -->

library(officer)

ppt_var <- read_pptx('test_presentation.pptx')
view(ppt_var)

Ideally, I could get the text of each notes slide added to individual variables to write to a csv. I am confident that I can handle the manipulation once I get the notes read in, but cannot seem to get that part down.

Thank you for any pointers or support!

M--
  • 25,431
  • 8
  • 61
  • 93
  • 1
    You can always read them from the `xml` file. I don't know of a package that will do it for you. – M-- Apr 29 '19 at 14:59
  • All office files are zipped XML. Unzip it, read the xml, you should be able to find the notes. – Gregor Thomas Apr 29 '19 at 15:00
  • Other option would be writing a VB Script and run that from R. Look here for the vbs tip. https://learn.microsoft.com/en-us/office/vba/api/powerpoint.slide.notespage – M-- Apr 29 '19 at 15:02
  • or `C#`: https://stackoverflow.com/questions/2164819/how-to-programmatically-read-and-change-slide-notes-in-powerpoint – M-- Apr 29 '19 at 15:09
  • This question is specifically about the officer package so I have voted to reopen it. – G. Grothendieck Apr 29 '19 at 22:15

2 Answers2

2

How do do that is shown in the code here: https://github.com/davidgohel/officer/issues/117 .

The following is based on that code:

library(magrittr)
library(officer)
library(xml2)

p <- read_pptx("mypresentation.pptx")
notes_dir <- file.path(p$package_dir, "ppt", "notesSlides")
files <- list.files(pattern = ".xml$", path = notes_dir, full.names = TRUE)

Notes <- lapply(files,
 . %>% 
   read_xml %>%
   xml_find_all("//a:t") %>%
   xml_text
)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

Assuming you are using the Document.OpenXML dependencies in C#, a more native way would be:

    public static SlidePart GetSlidePart(PresentationDocument pptxDoc, int index)
    {
        // Get the relationship ID of the first slide.
        PresentationPart presentationPart = pptxDoc.PresentationPart;
        OpenXmlElementList slideIds = presentationPart.Presentation.SlideIdList.ChildElements;
        string relId = (slideIds[index] as SlideId).RelationshipId;

        // Get the slide part from the relationship ID.
        return (SlidePart)presentationPart.GetPartById(relId);
    }

    public static string GetNoteText(PresentationDocument pptxDoc, int index)
    {
        //Get the Slide Part
        SlidePart slidePart = GetSlidePart(pptxDoc, index);
        //Extract the Note text
        return slidePart.NotesSlidePart.NotesSlide.InnerText.ToString();
    }
jmerrill2001
  • 61
  • 1
  • 5