1

Query

select distinct ?abstract where {
      [ rdfs:label "Rome"@en ;
        dbpedia-owl:abstract ?abstract ]
      filter langMatches(lang(?abstract),'en')
    }

Output:

Rome (/ˈroʊm/; Italian: Roma pronounced [ˈroːma] ; Latin: Rōma) is a city and special comune (named "Roma Capitale") in Italy…

How can I remove "(/ˈroʊm/; Italian: Roma pronounced [ˈroːma] ; Latin: Rōma)", which contains unreadable characters (i.e., the pronunciation guide)?

I got the query from the link.

Community
  • 1
  • 1
sai
  • 13
  • 3
  • 1
    What do you mean unreadable? Those are pronunciation guides, and they are part of the abstract. Your query asks for the article abstract, after all... – Joshua Taylor Oct 07 '14 at 03:49
  • Thank you for responding. I want the abstract but is there a option to remove those pronunciation guides?? – sai Oct 07 '14 at 16:06
  • Well, if you want to do it programmatically, you'd need to determine how to detect the stuff you want to remove. Can you precisely specify what you want to remove? You could, for instance, remove any text in parentheses. That might grab more than what you want, but it would probably do the trick. – Joshua Taylor Oct 07 '14 at 16:09

1 Answers1

1

You could use a query like this to remove text in parentheses:

select ?abstract ?cleanAbstract where {
  values ?x { dbpedia:Rome }

  ?x dbpedia-owl:abstract ?abstract
  filter langMatches(lang(?abstract),'en')

  bind( replace( str(?abstract), '\\([^(]*\\)', "" ) as ?cleanAbstract )
}

SPARQL results

?abstract: Rome (/ˈroʊm/; Italian: Roma pronounced [ˈroːma] ; Latin: Rōma) is a city and special comune (named "Roma Capitale") in Italy. Rome is the capital of Italy and also of the Province of Rome and of the region of Lazio. With 2.7 million residents in 1,285.3 km2 (496.3 sq mi), it is also the country's largest …

?cleanAbstract: Rome is a city and special comune in Italy. Rome is the capital of Italy and also of the Province of Rome and of the region of Lazio. With 2.7 million residents in 1,285.3 km2 , it is also the country's largest …

Of course, pronunciations are not the only thing found in parentheses. E.g., the area in square miles was given in parentheses. However, if abstracts follow the general convention that text in parentheses can be removed without altering the essential content of the text, this might work for you. You can, of course, improve the regular expression to handle spaces around the parentheses a bit better, or to only remove those with some "non-typical" characters, if you can define some.

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353