I have a problem statement to perform a matadata extraction from the power presentation files and perform document tagging using R or Rapid minner.So i need help in how to read the ppt files in both the tools and then perform the text processing.
-
You can't read the metadata directly with R or Rapidminer. But if you extract the data with some additional [tools](http://www.forensicswiki.org/wiki/Document_Metadata_Extraction), you should be able to import them. – David Feb 12 '16 at 09:22
-
Thank you so much David for your reply. In addition to that i had one more doubt that whether this document tagging problem can be solved with R and rapid miner or there is some other approch/tools for it. Please guide me on this as i am very new to this concept. – vijay Feb 12 '16 at 09:55
-
I'm not an expert on metadata tagging and tools for this task. From my experience if you are able to extract the data in some structured format (xml, plain text, ...) you can import them into RapidMiner and R. But for the extraqting part I can't recommend you any tools. – David Feb 12 '16 at 11:36
1 Answers
Just noticed I answered this on your duplicate question so I'm deleting my answer there and adding here to be more helpful to other users.
I answered a very similar question on the RapidMiner support site recently. Reading Powerpoint with RapidMiner
I'll reproduce the answer here: PPPTX files are simply ZIP directories which contain XML documents telling Powerpoint where to place each part of the content. All the slide content is stored in: /ppt/slides/ slide1.xml, slide2.xml, etc. (Other directories are available for slide notes, and other content).
To read it with RapidMiner simply use the operator Loop Zip-File Entries and set the parameter internal directory to ppt/slides this will loop through all the above mentioned xml files.
Inside that nested operator use the Read Document operator set to Extract Text Only & with content type of XML. This should extract the contents of every slide in the presentation.
That answers the first part of your question. For the second part you can use any of the RapidMiner text processing operators once you have the text in.