2

I have a very large text file around 43GB which I use to process them to generate another files in different forms. and i don't want to setup any databases or any indexing search engines

the data is in the .ttl format

<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1000> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lb.dbpedia.org/resource/Mohandas_Karamchand_Gandhi> .

target is generating all combinations from all triples who share same subject:

for example for the subject Q1000 :

<http://nl.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://en.dbpedia.org/resource/Gabon> .
<http://en.dbpedia.org/resource/Gabon> <http://www.w3.org/2002/07/owl#sameAs> <http://nl.dbpedia.org/resource/Gabon> .

the problem: the Dummy code to start with is iterating with complexity O(n^2) where n is the number of lines of the 45GB text file ,needless to say that it would take years to do so.

what i thought of to optimize :

  1. loading a HashMap [String,IntArray] for indexing lines of appearance each key and using any library to access the file by line number for example:

    Q1000 | 1,2,433
    Q1001 | 2334,323,2124

drawbacks is that the index could be relatively large as well , considering that we will have another index for the access with specific line number , plus the overloaded i didnt try the performance of the

  1. making a text file for each key like Q1000.txt for all triples contains subject Q1000 and iterating over them one by one and making combinations

drawbacks : this seems the fastest one and least memory consuming but certainly creating around 10 million files and accessing them will be a problem , is there and alternative for that ?

i'm using scala scripts for the task

Hady Elsahar
  • 2,121
  • 4
  • 29
  • 47
  • 1
    There might be a reason for not using a database, but actually you are going to *implement* one. Related question in this [context](http://stackoverflow.com/q/17739973/2390083) – Beryllium Jul 19 '13 at 07:36

2 Answers2

3

Take the 43GB file in chunks that fit comfortably in memory and sort on the subject. Write the chunks separately.

Run a merge sort on the chunks (sorted by subject). It's really easy: you have as input iterators over two files, and you write out whichever input is less, then read from that one again (if there's any left).

Now you just need to make one pass through the sorted data to gather the groups of subjects.

Should take O(n) space and O(n log n) time, which for this sort of thing you should be able to afford.

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • to make the merge sort in the min complexity o(nlogn) this will need number of files equal to number of subjects ~10M files , max files can be created it something near to 32K which is very far from that – Hady Elsahar Jul 19 '13 at 08:15
  • @HadyElsahar - `n` is the number of lines, not the number of subjects. This strategy is independent of the number of subjects. It's standard divide and conquer, O(n log n), regardless of the number of files. – Rex Kerr Jul 19 '13 at 14:48
1

A possible solution would be to use some existing map-reduce library. After all, your task is exactly what map-reduce is for. Even if you don't parallelize your computation on multiple machines, the main advantage is that it handles the management of splitting and merging for you.

There is an interesting library Apache Crunch with Scala API. I haven't used it myself, but it looks it could solve your problem well. Your lines would be split according to their subjects and then

Petr
  • 62,528
  • 13
  • 153
  • 317