-1

I have a compressed freebase data dump that has all the entities in it. How can I use grep or something else to trim the data dump to only contain english entities?

Here is what I am trying to get the rdf dump to look like: http://play.golang.org/p/-WwSysL3y3

<card>
    <title></title>
    <image></image>
    <text></text>
    <facts>
        <fact></fact>
        <fact></fact>
        <fact></fact>
    </fact>
</card>

Where card is each entity with content in all of the children elements. Title is the /type/object/name. Text is the image for mid of the topic done by "https://usercontent.googleapis.com/freebase/v1/image"%s"\n", id. Text is the /common/document/text for the entity. and facts and its fact children as the facts like age, birth-date, height, the facts that show up in the knowledge panels in search.

Here is my attempt to parse the rdf into xml like this in Go ( Golang ). I'd appreciate it if someone could help me get the rdf in this form.

Here is the algorithm or logic of what I am trying to do:

For every entity written in english:

    parse the `type/object/name`property's  and write that to the xml file in the `<title></title>` element.

    parse the mid and add that to `https://usercontent.googleapis.com/freebase/v1/image`and then write the result to the xml file in the <image></image> element.

    parse the common/document/text property and writes its value to the <text></text> element.

    And lastly, for each fact about the entity, write them to the <fact></fact> elements in the XML file, which are all children of the <facts></facts> element.
wordSmith
  • 2,993
  • 8
  • 29
  • 50
  • What do you mean by "English" entities? RDF is based on triples, and nodes in an RDF graph are URIs, blank nodes, and literals. Of those, literals can have language tags, so a literal could be considered to be an "English entity". However, literals can't be subjects or predicates in RDF triples, so you can't have an RDF graph consisting of only literals. – Joshua Taylor Sep 16 '14 at 20:01
  • @JoshuaTaylor I mean only english entities. As in entities who's content is in English. It seems as if there are foreign languages when I parse the rdf to xml. – wordSmith Sep 17 '14 at 01:09
  • It's not clear at all what you mean by "entity". RDF is a graph-based data format in which there are triples (labeled edges) of the form [subject predicate object]. Each subject is a URI or a blank node. Each predicate is a URI. Each object is either a URI, a blank node, or a literal. Of all those, the only thing that has a language is a certain kind of literal. You could filter out triples that have objects that are literals with a language tag for a non-English language. Is that what you want to do? – Joshua Taylor Sep 17 '14 at 01:17
  • Really, the best thing to do here would be to show us a sample of the data that you have, and to show what you'd like to turn it into. Just like the close reasons mention, the question should "include the desired behavior, [and] a specific problem or error. Questions without a clear problem statement are not useful to other readers." – Joshua Taylor Sep 17 '14 at 01:19
  • @JoshuaTaylor Thank you for your help in advance and for what you have give me thus far. I've updated the question with what I am trying to make and my code attempting to do so in golang. – wordSmith Sep 17 '14 at 21:53
  • I don't really know what the update is supposed to show. You asked for a grep-based solution. Can you show the actual text input that you're getting, and the actual text output you'd want to get. Assume that no one else knows what you want to do, and be very explicit. You can only program things that you can described precisely enough to do by hand. So, do a very small sample by hand. – Joshua Taylor Sep 18 '14 at 13:42
  • @JoshuaTaylor I thought I could do this with grep ( just extract all the english entities [ entities written in english ] ), but my ultimate goal is to convert the rdf to xml. I'm attempting to do that in golang. I've added my logic to the end of the question. If you want to see the input and output I am currently getting, I can write that too. – wordSmith Sep 18 '14 at 14:16
  • Your question seems to be a moving target. You need to identify some particular technical question, show us the input along with the actual and desired output. – Joshua Taylor Sep 18 '14 at 14:45

1 Answers1

0

I agree with Joshua Taylor that the question is difficult to decipher, because entity is usually a synonym for Freebase object, which may have labels in multiple languages (or no labels/text at all).

If we recast the question as something along the lines of "How do I filter all non-English text from the compressed Freebase dump?," it becomes something that we can actually answer.

In RDF, all strings are labeled with their language, so if we see something like

ns:award.award_winner   rdfs:label      "Lauréat"@fr.

We can tell that Lauréat is the French name for the Freebase type called Award Winner in English.

To filter out non-English labels, use zgrep to filter those lines which match "@... but not "@en. This will give you all the types, properties, numbers, and English labels/descriptions, but won't exclude those objects which don't have at least one English label (another possible interpretation of your question). To do that level of filtering, you'll probably need something more powerful than grep.

Tom Morris
  • 10,490
  • 32
  • 53