1

I have a local folder that contains 64 individual EVENTLOGSTATE files which are in XML format that I'm trying to access and read into R. I'm able to access the folder and list out all the specific files within that folder, but then when I try to use xmlParse from library(XML) to read in the files, it gives me an error that XML content does not seem to be XML.

For reference, I've created an example of my list.file line, my xmlParse line and the returned error as well as an example of file names within the folder along with what data is in each file.

list.files(path = "C:\\Users\\OneDrive\\Documents\\XML") #pulls list of file names within the XML folder

xmlParse(list.files(path = "C:\\Users\\OneDrive\\Documents\\XML"))
> xmlParse(list.files(path = "C:\\Users\\OneDrive\\Documents\\XML"))
Error: XML content does not seem to be XML: 'f5e450.eventLogState
EventLog-0e6f76b3-12bc-4d4a-aab6-a97600f5f46b.eventLogState
EventLog-11fbd569-4fd5-4bbe-89aa-a9df01378901.eventLogState
EventLog-151c1acc-0062-4f97-989a-a9d7015233f1.eventLogState

Each EventLog file contains data about recorded sessions that I need to be able to pull out the recording start and end times and then create a data frame along with calculations on the total length and visuals. But all of the files are separate and include information in this format:

<?xml version="1.0" encoding="utf-8"?>
<EventLogState xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Panopto.Recorder">
  <AttemptCount>5</AttemptCount>
  <ErrorInfo>Unable to generate event logs</ErrorInfo>
  <FileInfo i:nil="true" />
  <PanoptoSiteFQDN>hosted.panopto.com</PanoptoSiteFQDN>
  <RecordingEndTime>2018-10-11T12:13:38.1115286-04:00</RecordingEndTime>
  <RecordingId>0e6f76b3-12bc-4d4a-aab6-a97600f5f46b</RecordingId>
  <RecordingStartTime>2018-10-11T11:04:04.9321231-04:00</RecordingStartTime>
  <SessionId>c3c84fee-836b-4d30-8115-a97600f85490</SessionId>
  <Status>Error</Status>
</EventLogState>

I tried this loop solution, but it just returns a tibble 0 x 0

library(xml2)
library(dplyr)
files <- list.files(path = "C:\\Users\\OneDrive\\Documents\\XML")
dfs <-lapply(files, function(files) {
  page <- read_xml(file)
  id <- xml_find_first(out, "//EventLogState") %>% xml_attr("xmlns:i") 
  end.time <- xml_find_first(out, ".//RecordingEndTime") %>% xml_text()
  start.time <- xml_find_first(out, ".//RecordingStartTime") %>% xml_text()
  data.frame(id, end.time, start.time)
})

#combine all results into 1 data frame
answer <- bind_rows(dfs)
answer

Any ideas on how to get the xmlParse line to recognize each individual file and pull in a combined text version to work with?

Dave2e
  • 22,192
  • 18
  • 42
  • 50
data_life
  • 387
  • 1
  • 11
  • 2
    `xmlParse()` is not a vectorized function. You will need to create some type of loop passing the each file name to the function one by one. Here is a similar question/answer: https://stackoverflow.com/questions/49196674/read-multiple-xml-files-in-r-and-combine-the-data or https://stackoverflow.com/questions/66319733/xml-files-to-dataframe/66394378#66394378 – Dave2e Dec 10 '21 at 20:19
  • I have tried following both of those, and I don't really even know where to start. I'm really new to this and can't find any articles to help that I've been able to get to work. – data_life Dec 11 '21 at 05:44

1 Answers1

1

That was a good start. These files have a namespace associated with them, which does throw in a curve ball. The easiest way to handle the namespaces is to strip them out.
Also, ensure the correct file is referenced in the xml_find() functions.

This should now work for you:

library(xml2)
library(dplyr)
files <- list.files(path = "C:\\Users\\OneDrive\\Documents\\XML")
dfs <-lapply(files, function(file) {
   page <- read_xml(file)
   # #   Check for a namespeace
   #    xml_ns(page)
   # #   It is easier to work with the file if the namespace is removed
   xml_ns_strip(page)
   id <- xml_find_first(page, ".//RecordingId") %>% xml_text()
   end.time <- xml_find_first(page, ".//RecordingEndTime") %>% xml_text()
   start.time <- xml_find_first(page, ".//RecordingStartTime") %>% xml_text()
   data.frame(id, end.time, start.time)
})

#combine all results into 1 data frame
answer <- bind_rows(dfs)
answer

The above code assumes only one "EventLogState" node per file.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • I tried making those updates and it give me Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "function" – data_life Dec 11 '21 at 23:23
  • @data_life Sorry there was a typo in the `lapply()` definition, which I corrected. One shouldn't name variables with the same name as a function. "file" in this case. Sometimes, I should take my own advice :) – Dave2e Dec 13 '21 at 02:04