How to construct my own Datasets for TransE Algorithm in a specific knowledge graph

Question

now I am building a knowledge graph of the Chinese stock and want to build a news recommendation system. And I want to use TransE algorithm for the entity embedding and relationship embedding. But I do not have the dataset and don't know clearly how to build a dataset using my own knowledge graph?

score 0 · Answer 1 · answered Oct 01 '19 at 17:50

One start would be to use data from Wikidata. It has some information on Chinese companies (I suppose you are referring to companies listed on Chinese stock exchanges). For instance, https://www.wikidata.org/wiki/Q831445 displays information about Sinopec.

The data from Wikidata can be downloaded from the API, the large dumps files at https://dumps.wikimedia.org/wikidatawiki/ or the SPARQL endpoint at https://query.wikidata.org/.

You can get a list of companies listed on the Shenzhen Stock Exchange with the SPARQL query:

SELECT 
  ?company ?companyLabel
  ?industry ?industryLabel
{
  ?company wdt:P414 wd:Q517750 .
  OPTIONAL { ?company wdt:P452 ?industry }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,zh". }
}

The result is (also) available at https://w.wiki/9DM . This result can be extended by modifying the query and it can be downloaded in various formats. With the DESCRIBE SPARQL keyword you can get the triple format that may be useful for the TransE algorithm, e.g., DESCRIBE wd:Q831445 with the result at https://w.wiki/9DN .

It is possible to process the large dump files and make a knowledge graph embedding with Gensim's Word2Vec, see "Wembedder: Wikidata entity embedding web service" at https://arxiv.org/abs/1710.04099 . You can explore one result of this approach with the Wembedder webapp, e.g., https://tools.wmflabs.org/wembedder/most-similar/Q51747 displays the result of a "most similar" query in the knowledge graph embedding with Air China

How to construct my own Datasets for TransE Algorithm in a specific knowledge graph

1 Answers1