I have a very large text file around 43GB which I use to process them to generate another files in different forms. and i don't want to setup any databases or any indexing search engines
the data is in the .ttl format
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lb.dbpedia.org/resource/Mohandas_Karamchand_Gandhi> .
target is generating all combinations from all triples who share same subject:
for example for the subject Q1000 :
<http://nl.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://en.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
the problem: the Dummy code to start with is iterating with complexity O(n^2) where n is the number of lines of the 45GB text file ,needless to say that it would take years to do so.
what i thought of to optimize :
loading a HashMap [String,IntArray] for indexing lines of appearance each key and using any library to access the file by line number for example:
Q1000 | 1,2,433
Q1001 | 2334,323,2124
drawbacks is that the index could be relatively large as well , considering that we will have another index for the access with specific line number , plus the overloaded i didnt try the performance of the
- making a text file for each key like
Q1000.txt
for all triples contains subjectQ1000
and iterating over them one by one and making combinations
drawbacks : this seems the fastest one and least memory consuming but certainly creating around 10 million files and accessing them will be a problem , is there and alternative for that ?
i'm using scala
scripts for the task