15

I've got a 279MB file that contains ~10 million key/value pairs, with ~500,000 unique keys. It's grouped by key (each key only needs to be written once), so all the values for a given key are together.

What I want to do is transpose the association, create a file where the pairs are grouped by value, and all the keys for a given value are stored together.

Currently, I'm using Parsec to read in the pairs as a list of tuples (K,[V]) (using lazy IO so I can process it as a stream while Parsec is processing the input file), where:

newtype K = K Text deriving (Show, Eq, Ord, Hashable)
newtype V = V Text deriving (Show, Eq, Ord, Hashable)

tupleParser :: Parser (K,[V])
tupleParser = ...

data ErrList e a = Cons a (ErrList e a) | End | Err e                

parseAllFromFile :: Parser a -> FilePath-> IO (ErrList ParseError a) 
parseAllFromFile parser inputFile = do                               
  contents <- readFile inputFile                                     
  let Right initialState = parse getParserState inputFile contents   
  return $ loop initialState                                         
  where loop state = case unconsume $ runParsecT parser' state of    
                        Error err             -> Err err             
                        Ok Nothing _ _        -> End                 
                        Ok (Just a) state' _  -> a `Cons` loop state'
        unconsume v = runIdentity $ case runIdentity v of            
                                      Consumed ma -> ma              
                                      Empty ma -> ma                 
        parser' = (Just <$> parser) <|> (const Nothing <$> eof)      

I've tried to insert the tuples into a Data.HashMap.Map V [K] to transpose the association:

transpose :: ErrList ParseError (K,[V]) -> Either ParseError [(V,[K])]          
transpose = transpose' M.empty                                                   
  where transpose' _ (Err e)          = Left e                                
        transpose' m End              = Right $ assocs m                      
        transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs
        include k m v = M.insertWith (const (k:)) v [k] m                  

But when I tried it, I got the error:

memory allocation failed (requested 2097152 bytes)

I could think of a couple things I'm doing wrong:

  1. 2MB seems a bit low (considerably less than the 2GB RAM my machine has installed), so maybe I need to tell GHC it's ok to use more?
  2. My problems could be algorithmic/data structure related. Maybe I'm using the wrong tools for the job?
  3. My attempt to use lazy IO could be coming back to bite me.

I'm leaning toward (1) for now, but I'm not sure by any means.

rampion
  • 87,131
  • 49
  • 199
  • 315
  • What is the memory footprint of the application when this error occurs? – asm Dec 06 '12 at 02:04
  • Andrew Myers: just tried `+RTS -sstderr -RTS` - it works when I `C-c` to stop the executable early, but if I wait for the error to occur, it doesn't print the profiling info. Any suggestions on how to report the memory footprint at time of error? – rampion Dec 06 '12 at 02:34
  • 4
    I don't think (1) is the problem. Yes, it fails trying to allocate 2MB, but that's likely because this is a new chunk - i.e. it has already allocated all available memory and these 2MB are what makes the entire thing go belly up. – us2012 Dec 06 '12 at 07:52
  • 7
    (2) If you only need the transposed map, why not create _it_ as you read the file instead of afterwards? – AndrewC Dec 06 '12 at 08:54
  • (2) If it's the parser from your other question, and it does what you want, then you could [simplify it dramatically](http://stackoverflow.com/questions/13670340/is-this-idiomatic-use-of-text-parsec#comment18839628_13675224) or change it to do more thorough checking. – AndrewC Dec 06 '12 at 08:55
  • As Andrew said, unless you need both maps (but with 2GB RAM, that could be too much anyway, there's quite a bit of overhead), you should create the transposed map directly. And `transpose' m (Cons (k,vs) xs) = transpose' (L.foldl' (include k) m vs) xs` will build up huge thunks, since `transpose'` is not strict in the `Map` argument, so the `foldl'` will not be evaluated until the end. Making `transpose'` strict in the `Map` argument may suffice, but maybe more is needed. – Daniel Fischer Dec 06 '12 at 11:40
  • Daniel Fischer: I thought I was only creating one map. I'm attempting to just read one `(K,[V])` from the file at a time and insert it into my map, just with a an `ErrList` acting as a stream in between so I can separate concerns. But maybe I'm failing at that. – rampion Dec 06 '12 at 12:26
  • @rampion I was just wondering what you see in whatever process monitor you have on your system. What's the RES column in top show for example? – asm Dec 06 '12 at 13:51
  • Daniel Fischer: Well, making `transpose'` strict in its map argument made it take much longer until I reached the memory failure. – rampion Dec 06 '12 at 20:15
  • Andrew Meyers: time to failure is several hours, which is a bit more than I want to spend watching top :) – rampion Dec 06 '12 at 22:14
  • Ok, I added a progress bar so I could see how it was doing and let it run for a couple hours until it had processed ~25% of the data and had frozen up pretty well. [The code, profiling results, and run log is here](https://gist.github.com/4237357). I do seem to be using quite a bit of memory, but I don't know how to tell how much of that is in the `HashMap` and how much of that is wasted thunks. – rampion Dec 07 '12 at 23:11
  • 3
    If you could specify the file format I could take a stab at it. – tibbe Dec 08 '12 at 15:23
  • Need more information on the input file. – Don Stewart Feb 24 '13 at 12:32
  • @DonStewart: [I `tar`'ed and `gzip`'ed it, and uploaded it here](https://docs.google.com/file/d/0B6-U1uFbN8Ckck9EZXNQc25DeDA/edit?usp=sharing). It's `category` on a line, followed by zero or more lines of `rank. url`. I give a little more detail [in my code](https://gist.github.com/rampion/4237357). – rampion Feb 25 '13 at 01:07
  • 3
    You know that in order to rotate an object, there are 2 methods, either rotate the object, or rotate your perspective .. think about that – Khaled.K Apr 07 '13 at 17:01

2 Answers2

1

Is there the possibility that the data will increase? If yes then I'd suggest not to read the while file into memory and process the data in another way.

One simple possibility is to use a relational database for that. This'd be fairly easy - just load your data in, create a proper index and get it sorted as you need. The database will do all the work for you. I'd definitely recommend this.


Another option would be to create your own file-based mechanism. For example:

  1. Choose some limit l that is reasonable to load into memory.
  2. Create n = d `div` l files, where d is the total amount of your data. (Hopefully this will not exceed your file descriptor limit. You could also close and reopen files after each operation, but this will make the process very slow.)
  3. Process the input sequentially and place each pair (k,v) into file number hash v `mod` l. This ensures that the pairs with the same value v will end up in the same file.
  4. Process each file separately.
  5. Merge them together.

It is essentially a hash table with file buckets. This solution assumes that each value has roughly the same number of keys (otherwise some files could get exceptionally large).


You could also implement an external sort which would allow you to sort basically any amount of data.

Petr
  • 62,528
  • 13
  • 153
  • 317
-1

To allow for files that are larger than available memory, it's a good idea to process them in bite-sized chunks at a time.

Here is a solid algorithm to copy file A to a new file B:

  • Create file B and lock it to your machine
  • Begin loop
    • If there isn't a next line in file A then exit loop
    • Read in the next line of file A
    • Apply processing to the line
    • Check if File B contains the line already
    • If File B does not contain the line already then append the line to file B
    • Goto beginning of loop
  • Unlock file B

It can also be worthwhile making a copy of file A into a temp folder and locking it while you work with it so that other people on the network aren't prevented from changing the original, but you have a snapshot of the file as it was at the point the procedure was begun.

I intend to revisit this answer in the future and add code.

WonderWorker
  • 8,539
  • 4
  • 63
  • 74