2

I need to merge 1000+ .ttl files into one file database. How can I merge them with filtering the data in the source files and keep only the data needed in the target file?

Thanks

Aram
  • 123
  • 1
  • 8
  • the phrase *"merge specific classes"* can't be answered given that an RDF dataset is a set of triples, so you first have to define which triples to keep in the merged dataset. And don't answer with "related" - that's also not a precise answer. You have to be as specific as possible, otherwise you can't create a procedure that does the filtering. – UninformedUser Mar 15 '19 at 12:15

1 Answers1

2

There's a number of options, but the simplest way is probably to have use a Turtle parser to read all the files, and let that parser pass its output to a handler which does the filtering before in turn passing the data to a Turtle writer.

Something like this would probably work (using RDF4J):

  RDFWriter writer = org.eclipse.rdf4j.rio.Rio.createWriter(RDFFormat.TURTLE, outFile);

  writer.startRDF();
  for (File file : // loop over your 100+ input files) {
      Model data = Rio.parse(new FileInputStream(file), "", RDFFormat.TURTLE);
      for (Statement st: data) {
         if (// you want to keep this statement) {
              writer.handleStatement(st);
         }
      }
  }
  writer.endRDF(); 

Alternatively, just load all the files into an RDF Repository, and use SPARQL queries to get the data out and save to an output file, or if you prefer: use SPARQL updates to remove the data you don't want before exporting the entire repository to a file.

Something along these lines (again using RDF4J):

 Repository rep = ... // your RDF repository, e.g. an in-memory store or native RDF database

 try (RepositoryConnection conn = rep.getConnection()) {

    // load all files into the database
    for (File file: // loop over input files) {
        conn.add(file, "", RDFFormat.TURTLE);
    }

    // do a sparql update to remove all instances of ex:Foo
    conn.prepareUpdate("DELETE WHERE { ?s a ex:Foo; ?p ?o }").execute();

    // export to file
    con.export(Rio.createWriter(RDFFormat.TURTLE, outFile));
 } finally {
    rep.shutDown(); 
 } 

Depending on the amount of data / the size of your files, you may need to extend this basic setup a bit (for example by using transactions instead of just letting the connection auto-commit). But you get the general idea, hopefully.

Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73
  • I only just now noticed that this question was tagged with 'jena', apologies. I'm reasonably certain the above approaches can be done using Jena instead of RDF4J as well. – Jeen Broekstra Mar 15 '19 at 04:54
  • To be more specific with the problem: I need to merge all the files .ttl from this database https://github.com/DOREMUS-ANR/knowledge-base/blob/master/data/philharmonie/pp.works.tar.gz with keeping only the classes F22 and F28. So at the end I will have one file .ttl with the mentioned classes only - no other information: How can I do it? For me it is important the final result - through Jena or RDF4J does not matter much for me. I prefer the simplest variant. – Aram Mar 15 '19 at 09:57
  • @Aram I think that's probably doable using one of the approaches I mentioned above. You should edit your original question to clarify what specific problem you're facing and where you're stuck. Have a look at [ask], which gives useful tips on how to write a question that has a good chance of getting a useful answer. – Jeen Broekstra Mar 15 '19 at 11:10
  • @Aram The question is, what means *"with keeping only the classes F22 and F28."* - RDF is made of triples, so it remains unclear which triples you want to have in the merged dataset ... – UninformedUser Mar 15 '19 at 12:13
  • I need to create one file database with merging the data from different files, keeping only the data/triples I will need. Considering the database I mentioned I will need the following data / triples to merge in one file: efrbroo:F28_Expression_Creation ecrm:P9_consists_of ecrm:E7_Activity ecrm:P14_carried_out_by ecrm:E21_Person ecrm:P131_is_identified_by mus:U31_had_function efrbroo:F11_Corporate_Body . My problem is how to construct a query, which will merge all the .ttl files into one .ttl file with keeping only the mentioned data/triples. – Aram Mar 15 '19 at 19:51
  • Thank you very much for the replies and sorry for lot of questions. I am new to RDF and I need to understand how to manipulate them – Aram Mar 15 '19 at 20:10
  • @Aram I suggest your best way forward is to give it a try using one of the approaches above. Then, if you get stuck, you can ask a new, more specific question about the precise bit you're struggling with. – Jeen Broekstra Mar 17 '19 at 02:11
  • @JeenBroekstra Thanks a lot for the help. I have opened a new question with code compilation problems here: https://stackoverflow.com/questions/55207723/rdf4j-ttl-files-code-compilation-problems – Aram Mar 17 '19 at 13:47
  • @JeenBroekstra I am trying to run the repo version but instead of writing the query result to the new file I get the full file content in the new file. I have made it with SELECT statement. Here is the query I am running: – Aram Mar 25 '19 at 15:21
  • conn.prepareQuery("PREFIX ecrm: \n" + "PREFIX foaf: <" + FOAF.NAMESPACE + "> \n" + "PREFIX mus: \n" + "SELECT ?artist ?name " + "WHERE { " + "?func mus:U31_had_function []; " + "ecrm:P14_carried_out_by ?artist . " + "?artist foaf:name ?name" + " }"); conn.export(Rio.createWriter(RDFFormat.TURTLE, outFile)); } – Aram Mar 25 '19 at 15:22
  • How can I get the result of my SELECT query with repo version? Thanks – Aram Mar 25 '19 at 15:23
  • 'try (RepositoryConnection conn = db.getConnection()) { File dir = new File("ttl/"); String[] fileNames = dir.list(); for (String file : fileNames) { File f = new File(dir, file); conn.add(f, "", RDFFormat.TURTLE); String queryString = "..."; TupleQuery query = conn.prepareTupleQuery(queryString); try (TupleQueryResult result = query.evaluate()) { while (result.hasNext()){ BindingSet solution = result.next(); SOP("?artist =" + solution.getValue("artist")); SOP("?name = " + solution.getValue("name"));} }finally{ db.shutDown();} ' – Aram Mar 25 '19 at 19:21
  • @Aram I suggest that you look at the rdf4j tutorials and the javadoc and try to understand how the code works. Go over it line by line. Hint: your use of the `export` method in combination with a query here is incorrect. – Jeen Broekstra Mar 25 '19 at 20:30