Looking up and extracting a line from a big file matching the lines of another big file

Question

I permitted myself to create a new question as some parameters changed dramatically compared to my first question in my bash script optimising (Optimising my script which lookups into a big compressed file)

In short : I want to lookup, and extract all the lines where the variable of the first column of a file(1) (a bam file) matches the first column of a text file (2). For bioinformaticians, it's actually extracting the matching reads id from two files. File 1 is a binary compressed 130GB file File 2 is a tsv file of 1 billion lines

Recently a user came with a very elegant one liner combining the decompression of the file and the lookup with awk and it worked very well. With the size of the files it is now looking up for more than 200 hours (multithreaded).

Does this "problem" have a name in algorithmics ?
What could be a good way to tackle this challenge ? (If possible with simple solutions such as sed, awk, bash .. )

Thank you a lot

Edit : Sorry for the code, as it was on the link I though it would be a "doublon". Here is the one liner used :

#!/bin/bash

samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'

Can you sort one or both files? Even sorting a copy would help reduce the number of comparisons needed. — rossum, Dec 14 '20 at 22:21
Maybe look at benchmarking [seqkit](https://bioinf.shenwei.me/seqkit/usage/#bam) for its goroutines-style multithreading, e.g. `awk -F$'\t' '{print $1}' large_file_of_ids.tsv > ids.txt && seqkit bam -j 16 -g ids.txt large_bam_file.bam > regions_of_interest.bam`. Also, given the size of the files, it might be faster to identify regions that are 'not of interest' and use the `-G, --exclude-ids` flag (I've used this approach before and got a ~1000X speedup for this step in my pipeline) — jared_mamrot, Dec 14 '20 at 22:34
@jared_mamrot : thanks, I'll have a look at it. For the regions of not-interested, it's an elegant way, but I'm actually looking at these "weird" regions and all are of interest.. which does not help me. — H. Vilbert, Dec 14 '20 at 22:41
you haven't provided a copy of your current code so I'm going to assume that for each record found in the first file, you're (re)reading the second file (eg, 500 finds in the first file means (re)reading the second file, in its entirety, 500 times); generally speaking you want to limit yourself to as few reads of each file as possible, with the fastest method being a single read of each file; but also keep in mind the amount of memory you'll require if you plan to read the first file and save a **lot** of data in memory before processing the second file ... — markp-fuso, Dec 14 '20 at 23:49
how much memory? **1)** how many matching lines are you expecting from the first file? **2)** do you need to keep the full matching line from both files or just from the 2nd file and regardless, do you have an idea of the number of characters per line in each file? if the memory requirements are too high you'll likely need to look at some sort of merge/join process whereby both files are read/processed at the same time (requires both files are sorted by the join key), thus allowing results to be printed asap as opposed to having intermediate results stored in memory — markp-fuso, Dec 14 '20 at 23:49
assuming the memory requirements will be too high, I'm thinking the easiest method to try first would be a solution based on the `join` command; after that I'd probably look at some custom code to emulate the `join` command, ie, read from both files in parallel; and assuming you may need to run this operation multiple times against these same data sets then I'd probably look at some sort of database solution whereby the join columns can be indexed for fast lookup/joins — markp-fuso, Dec 14 '20 at 23:49
If both files are sorted and on the same index, then you could effectively do something like a modified one stage merge-sort. Stepping through the files in order, catching any matches as you go. That way you only need a minimum one record from each file in memory at a time. — rossum, Dec 14 '20 at 23:54
@markp-fuso thanks. Memory : 64GB. 1) I'm expecting 2 matches at best in the biggest file (1) for each line on the smaller (2) (sometimes more, but most likely they are artifactuals so out of my scope) 2) I need to keep 3 columns : $1, $3 and $4. Do you have an idea of where I could learn a bit about those database solution for a "quick fix" (that is a side project which however important can't keep me busy for a full week) — H. Vilbert, Dec 15 '20 at 00:02
@rossum Yes, I thought about the solution with awk to exit the current loop when 2 matches are found. However I'm not sure I could integrate it in my one-liner as shown in the edit of my question.. Can I? — H. Vilbert, Dec 15 '20 at 00:03
unless I'm misunderstanding your comment ... sounds like the amount of data you're looking for is quite small (ie, less than a few K, if that), so I'm thinking at `awk` solution (read from first file, store info in memory/array; read from second file and print results); as for the database solution ... if you're not familiar with (relational) databases then this isn't a short subject ... you'd be better off finding someone in your group/office who works with databases to help with creating the necessary tables and indexes (assuming you have access to an actual database system/product) — markp-fuso, Dec 15 '20 at 00:07
If your process is running for 200 hrs (and multi-threaded (??) at that) ... based on just a few matches ... then we're either **a)** missing some crucial details or **b)** there's something (really) wrong with the code you're using; either way, don't limit yourself to a previous solution that doesn't quite do what you now want to do ... and keep in mind that the best/fastest solutions aren't always one-liners !! :-) — markp-fuso, Dec 15 '20 at 00:09
I might have not explained myself well indeed ! The file (1) contains several billions of lines. I am interested however in only in only the first of a group of four (I haven't though actually of that yet..) but that is sequencing data which aim was to cover the full human genome with more than 30 times. On the other hand the file (2) has 1 billion match to find in (1). Previously it worked fine because I had around 800 matches in a like-file (2) and it took already ~8h to complete. Indeed the database seems complicated but I can ask tomorrow. — H. Vilbert, Dec 15 '20 at 00:11
billions of lines of data ... looking for a relatively small number of rows from said data set ... will (likely) have to run this type of search (quite) often ... yeah, you definitely want to look at some sort of database/storage solution which allows for indexing of the data; the time it's taking you to run a single 'search' (200 hrs?) is time that could be spent on developing/building a db solution; get a DBA involved from the start ... they can help with finding room in a database, the best storage method (column store for lots of repeated data?), indexing approach, and making a backup — markp-fuso, Dec 15 '20 at 00:18
using text for this is not quite a good idea. Binary formats like databases would be far better — phuclv, Dec 15 '20 at 00:21
Before going for the "big stuffs" I was thinking another approach to refine my knowledge about the looking up. It seems so far that I have a very few amount of matches actually. I tried to randomly extract 10% of the lines of file(2) and try the "normal" approach. However, I'm blocked in my awk knowledge, because I think that if I add a way to exit at the first exit, as the lines from file (2) are altogether processed, it will exit at the first match for the file, not first match for a line query. Am I right? This anyway will be a small optimisation as I have a very small number of matches. — H. Vilbert, Dec 16 '20 at 11:48

rossum · Answer 1 · 2020-12-15T12:11:17.483

0

Think of this as a long comment rather than an answer. The 'merge sort' method can be summarised as: If two records don't match, advance one record in the file with the smaller record. If they do match then record the match and advance one record in the big file.

In pseudocode, this looks something like:

currentSmall <- readFirstRecord(smallFile)
currentLarge <- readFirstRecord(largeFile)
searching <- true
while (searching)
  if (currentLarge < currentSmall)
    currentLarge <- readNextRecord(largeFile)
  else if (currentLarge = currentSmall)
    //Bingo!
    saveMatchData(currentLarge, currentSmall)
    currentLarge <- readNextRecord(largeFile)
  else if (currentLarge > currentsmall)
    currentSmall <- readNextRecord(smallFile)
  endif

  if (largeFile.EOF or smallFile.EOF)
    searching <- false
  endif
endwhile

Quite how you translate that into awk or bash is beyond my meagre knowledge of either.

edited Dec 15 '20 at 12:11

answered Dec 15 '20 at 08:55

rossum

15,344
1
24
38

Thanks for this ! I'll try to make something close to it and try it – H. Vilbert Dec 15 '20 at 10:24
while a google search on `bash concurrent read from files` brings up quite a few hits, [this answer](https://stackoverflow.com/a/29422025) seems to have the best, straight forward approach; one downside is the answer assumes both files have the same number of lines, so for files with different numbers of lines the coder will need a couple follow-on `while` loops (one for each file) to drain whatever file still has lines – markp-fuso Dec 15 '20 at 18:02
Can't you just stop the search when one file reaches End-of-File? – rossum Dec 15 '20 at 19:06
@rossum; generally speaking it depends on what the coder wants to do ... stop once the first file is exhausted ... stop once the second file is exhausted ... process both files regardless of the number of lines ... – markp-fuso Dec 15 '20 at 21:55
That's indeed interesting, I'll try to adapt it as much as I can to my issue.. – H. Vilbert Dec 16 '20 at 09:30

Looking up and extracting a line from a big file matching the lines of another big file

1 Answers1