6

I have one input xml file.

cat sample.xml

<Text>
    &lt;p&gt;ABC &lt;/p&gt;
</Text>

R script

library(XML)
doc = xmlTreeParse("sample.xml", useInternal = TRUE)
top<-xmlRoot(doc)

sub("&lt;","<",top[[1]])

How can i fix above pblm?

Error Message: Error in as.vector(x, "character") : cannot coerce type 'externalptr' to vector of type 'character'

Edit: Aim is to use readHTMLTable() function for particular node in xml which has html table but it has xml markup( &gt; and &lt;) for > and < which need to be repalced first as readHTMLTable function cannot handle xml markup.

Manish
  • 3,341
  • 15
  • 52
  • 87
  • 1
    The XML entity markup GETS replaced by magic by the XML package functions, as I explained. The `xmlValue` function returns this. Feed that into `readHTMLTable` and job done. – Spacedman Jan 31 '13 at 08:56

3 Answers3

6

And now the answer to your real question:

sample.xml with encoded table:

<Text>
&lt;table&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;32&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
</Text>

Read it in:

> library(XML)
> doc = xmlTreeParse("sample.xml", useInternal = TRUE)
> top<-xmlRoot(doc)

Convert to text:

> table=xmlValue(top)
> table
[1] "\n<table>\n<tr><td>1</td><td>2</td></tr>\n<tr><td>2</td><td>8</td></tr>\n<tr><td>4</td><td>32</td></tr>\n</table>\n"

This is now ready to feed to readHTMLTable. No string conversion needed:

> readHTMLTable(table)
$`NULL`
  V1 V2
1  1  2
2  2  8
3  4 32

Howzat?

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • 1
    Perfect answer! Is the anyway to replace html table new table.$`NULL` V1 V2 1 1 2 2 2 8 3 4 32 – Manish Jan 31 '13 at 09:41
  • 1
    Huh what? Do you want to replace the element with R's output? That's worth asking as another question, and has probably been answered already... – Spacedman Jan 31 '13 at 10:54
  • Yes, i also want to thw same. I want to replace only Table within with R's output. – Manish Jan 31 '13 at 11:44
  • 1
    Okay, you are moving the goalposts now. StackOverflow works best with solutions to simple problems. This question/answer has solved one - how to get and manipulate text in XML. Now go ask the next step in a separate question. – Spacedman Jan 31 '13 at 12:00
  • Actually my aim is not to manipulate but just to print the content of as uploaded here http://textuploader.com/?p=6&id=5ZoIe apart html table specailly in

    tag
    – Manish Jan 31 '13 at 14:19
5

If your question is to know how to replace a string in the content of an XML node, then you can check the following code, using the sample.xml file you provided :

## Parse the XML file
doc <- xmlTreeParse("sample.xml", useInternal = TRUE)
## Select the nodes we want to update
nodes <- getNodeSet(doc, "//Text")
## For each node, apply gsub on the content of the node
lapply(nodes, function(n) {
  xmlValue(n) <- gsub("ABC","foobar",xmlValue(n))
})

Which will give you :

R> doc
<?xml version="1.0"?>
<Text>
    &lt;p&gt;foobar &lt;/p&gt;
</Text>

Here you can see that "ABC" as been replaced by "foobar".

But, if you try this code with the substitution you want to achieve (replace "&lt;" wit "<"), it apparently won't work :

doc <- xmlTreeParse("sample.xml", useInternal = TRUE)
nodes <- getNodeSet(doc, "//Text")
lapply(nodes, function(n) {
  xmlValue(n) <- gsub("&lt;","<",xmlValue(n))
})

will give you :

R> doc
<?xml version="1.0"?>
<Text>
    &lt;p&gt;ABC &lt;/p&gt;
</Text>

Why ? If you are working with XML files, you should know that some characters, mainly <, >, & and " are reserved as they are part of the base XML syntax. As such, they cannot appear in the content of the nodes, otherwise parsing would fail. So they are replaced by entities, which are a sort of coding of these characters. For example, "<" is coded as "&lt;", "&" is coded as "&amp;", etc.

So here, the content of your node contains a "<" character, which has been automatically converted to his entity "&lt;". What you try to do with your code is to replace "&lt;" back with "<", which R will gladly do for you, but as it is a text content of a node, the XML package will immediatly convert it back to "&lt;".

So, if what you want to achieve is to convert your string "&lt;p&gt;ABC &lt;/p&gt;" to a new XML node "<p>ABC </p>", you can't do it that way. A solution would be to parse your text string, detect the name and of the node (here, "p") from it, create a new node with xmlNode(), give it the text content "ABC" and replace the string with the node you just created.

Another quick and dirty way to do it would be first to replace all the entities in your file without parsing the XML. Something like this :

txt <- readLines(file("sample.xml"))
txt <- gsub("&lt;", "<", txt)
txt <- gsub("&gt;", ">", txt)
writeLines(txt, file("sample2.xml"))
doc2 <- xmlTreeParse("sample2.xml", useInternal = TRUE)

Which gives :

R> doc2
<?xml version="1.0"?>
<Text>
  <p>ABC </p>
</Text>

But this is dangerous, because if there is a "real" "&lt;" entity in you file, parsing will fail.

juba
  • 47,631
  • 14
  • 113
  • 118
  • Above code converts xml node into list but i need xml output so that i can this node using readHTMLTable() function. – Manish Jan 31 '13 at 08:49
  • 1
    Yes, I know you need xml output, that's the whole point of my post. And what is the code that converts xml into a list ? – juba Jan 31 '13 at 08:53
  • 1
    I have edited my question. I cannot use readHTMLTable fucntion with xmlNode. – Manish Jan 31 '13 at 08:58
  • 1
    Yes. And you've got to transform your string into a parsable xml structure, and that's what my answer is about. So please read it. – juba Jan 31 '13 at 09:01
2

Ge the value of the node with xmlValue and replace. Here I'm going to replace the ABC with DEF:

> top<-xmlRoot(doc)
> top
<Text>
    &lt;p&gt;ABC &lt;/p&gt;
</Text> 
> xmlValue(top)=sub("ABC","DEF",xmlValue(top))
> top
<Text>
    &lt;p&gt;DEF &lt;/p&gt;
</Text> 

The reason I don't try to replace the < is because those character sequences are getting interpreted at some point by the XML code:

> substr(xmlValue(top),6,6)=="<"
[1] TRUE

although I've tried mucking around with some of the options to xmlTreeParse and other XML package functions but I can't seem to stop xmlValue interpreting them...

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • 1
    Abc is just an example, Actually i have one table inplace of ABC. I think we cannot use xmlValue for whole table. – Manish Jan 31 '13 at 08:30
  • 2
    Yes, well what I'm saying is that replacing simple text like that is not a problem, and here is the solution, but your entity-encoded HTML tag angle-brackets (`<` and '>`) aren't becoming what you think they are. – Spacedman Jan 31 '13 at 08:33
  • 1
    I need to replace > with corresponding tags. For that i m replacing it. – Manish Jan 31 '13 at 08:36
  • 2
    The XML package is replacing the encoded less-than and greater-than signs for you. Do you need any more help? – Spacedman Jan 31 '13 at 08:45
  • @Spacedman, I am trying to implement this [https://sharepoint.stackexchange.com/questions/73401/how-to-check-out-a-file-from-sharepoint-document-library-using-curl/75958[ to checkin a file that I upload to SharePoint using curl from a shiny app running on a Rshiny server. The file name will change hence I am not able to use a stored xml. Is there a way you would advice me to achieve this? – RanonKahn Sep 06 '17 at 00:33