3

First of all, I am sorry if this is a repeated question. I tried for several hours already and I see different solutions for PHP or other languages but not for R.

I am retrieving data from the last.fm website using their API. You do need an API key to retrieve the data I am trying to get but I will make it simpler here and hopefully you can answer my question.

Here is my problem: At certain point, when retrieving the data, I encounter an error which stops my request. I skipped it once but it comes back again and again. I always get the same: PCDATA invalid Char value #

Here is an example:

string = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<lfm status=\"ok\">\n<results for=\"a\" xmlns:opensearch=\"http://a9.com/-/spec/opensearch/1.1/\">\n<opensearch:Query role=\"request\" searchTerms=\"a\" startPage=\"1382\" />\n<opensearch:totalResults>212588</opensearch:totalResults>\n<opensearch:startIndex>1381</opensearch:startIndex>\n<opensearch:itemsPerPage>1</opensearch:itemsPerPage><artistmatches>\n<artist>\n    <name>!B0A \0348E09;&gt;2</name>\n                <listeners>1672</listeners>\n                <mbid></mbid>\n                        <url>http://www.last.fm/music/!B0A+%1C8E09;%3E2</url>\n    <streamable>0</streamable>\n            <image size=\"small\">http://userserve-ak.last.fm/serve/34/88015017.png</image>\n        <image size=\"medium\">http://userserve-ak.last.fm/serve/64/88015017.png</image>\n        <image size=\"large\">http://userserve-ak.last.fm/serve/126/88015017.png</image>\n        <image size=\"extralarge\">http://userserve-ak.last.fm/serve/252/88015017.png</image>\n        <image size=\"mega\">http://userserve-ak.last.fm/serve/_/88015017/B0A+8E092+15286997.png</image>\n    </artist></artistmatches>\n</results></lfm>\n"

When I try to parse this text I get the error:

doc = xmlParse(string, asText = TRUE)
PCDATA invalid Char value 28
Error: 1: PCDATA invalid Char value 28

I believe the part that is making this happen comes from this part of the string:

<name>!B0A \0348E09;&gt;2</name>\n 

But I can't be sure now.

What I am looking for is one of these solutions, being the first one the ideally situation but any of the others will make me happy:

1 - Allow R to receive these invalid characters

2 - Eliminate the invalid characters and continue with the parse without stopping.

3 - Skip the string with the invalid characters and continue with the parse

4 - Create a function to find the invalid characters so I can include that when retrieving the data from last.fm

I hope you can understand the question and help me with it. Thanks in advance

JMarchante
  • 123
  • 1
  • 4
  • 11
  • I don't know about R. But I do know about the Last.fm API (I maintain a C# SDK). Can you post the URL of your request so that I can try for myself? It looks like an encoding issue in the response (does R support UTF-8?), testing on a different platform will confirm. – rikkit Oct 29 '14 at 11:10

1 Answers1

0

You are right. The artist name has an illegal characters for XML parsing.

Try this out:

    illegal <- "[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]" 
    
    utf8_for_xml <- function(x) {
        
        return(gsub(illegal, "", x))
        
        }

    string_formatted <- utf8_for_xml(string)

    xmlParse(string_formatted)
<?xml version="1.0" encoding="utf-8"?>
<lfm status="ok">
  <results xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" for="a">
    <opensearch:Query role="request" searchTerms="a" startPage="1382"/>
    <opensearch:totalResults>212588</opensearch:totalResults>
    <opensearch:startIndex>1381</opensearch:startIndex>
    <opensearch:itemsPerPage>1</opensearch:itemsPerPage>
    <artistmatches>
      <artist>
        <name>!B0A 8E09;&gt;2</name>
        <listeners>1672</listeners>
        <mbid/>
        <url>http://www.last.fm/music/!B0A+%1C8E09;%3E2</url>
        <streamable>0</streamable>
        <image size="small">http://userserve-ak.last.fm/serve/34/88015017.png</image>
        <image size="medium">http://userserve-ak.last.fm/serve/64/88015017.png</image>
        <image size="large">http://userserve-ak.last.fm/serve/126/88015017.png</image>
        <image size="extralarge">http://userserve-ak.last.fm/serve/252/88015017.png</image>
        <image size="mega">http://userserve-ak.last.fm/serve/_/88015017/B0A+8E092+15286997.png</image>
      </artist>
    </artistmatches>
  </results>
</lfm>

Extra:

Let's find out which character is illegal for XML in your string object.

The function gregexpr finds the character number:

 gregexpr(illegal, string)
[1] 403
attr(,"match.length")
[1] 1

using "Unicode" package:


require(Unicode)
unicode_string <- as.u_char(utf8ToInt(string))
unicode_string[403]

[1] U+001C


The Unicode U+001C is the "Information Separator Four" and it is illegal for parsing in XML.