0

I have a bash script for the purpose of splitting up a huge input file -- at the moment it's 400MB, later the script should split a 4GB file.

The core or this processing is the following awk script:

INPUTFILE="FA.txt"

awk -F $'\t' 'BEGIN{
    count = 1;
    vcount = 1;
    hcount = 1;
    tmp = 0;
    while (getline "'"$INPUTFILE"'")
    {
        FAv[count] = $1;
        FAh[count] = FAv[count];
        BK[count] = $2;
        vBreak[count] = $3;
        Count++;
    }
    close("'"$INPUTFILE"'");
}

{
    str1 = sprintf("%s%s%s",FAv[vcount],"v",".txt");
    str2 = sprintf("%s%s%s",FAh[hcount],"h",".txt");
    if (NR > (vBreak[vcount+1]-1))
    {
        close(str1);
        vcount ++;
    }
    if (($22-tmp) > BK[hcount])
    {
        close(str2);
        tmp = BK[hcount];
        hcount++;
    }
    printf "...\n",(many columns) >> str1;
    printf "...\n",(many columns) >> str2;
}' Data.txt

Data.txt is a very big tab-separated table with about 40 columns and approximately 2.6 million lines; the file the script should handle later on would have about 30 million lines. The input file I am using right now should make about 300 files, the one the script is meant to process later should create about 4000 files.

The lines close(str1); and close(str2); don't change the error message I get which is

awk: (filename)h.txt makes too many open files
Input record number 157762, file Data.txt
source line number 7
awk: (filename)h.txt make too many open files
Input record number 157762, file Data.txt
source line number 10

The source line numbers given are the equivalent of them in the given snippet here, in my script they are at different positions.

The file "FA.txt" which is used to generate splitting conditions is 3KB big and has 155 lines and 3 columns so this shouldn't make any problem for awk at all. I am afraid I cant really give out dummy data as the data comes from a company I am working for.

I do not see where the problem in the code is located, any help would be greatly appreciated.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Friedrich
  • 29
  • 2
  • How many files do you have open when it breaks? How many do you need to keep open simultaneously in order for this to succeed? How many open files does your Awk version support, and which Awk is that? Can you switch to a different version if that helps fix the problem? – tripleee Sep 28 '15 at 07:33
  • Closing after every write should circumvent the problem anyway. Obvious google hits: http://stackoverflow.com/questions/19643934/how-to-split-a-large-csv-file-in-unix-command-line; http://stackoverflow.com/questions/23508959/awk-error-makes-too-many-open-files – tripleee Sep 28 '15 at 07:38
  • It breaks after file 17 in that script, i use the awk Version that Comes preinstalled with MAC OS X El Capitan, i cannot Switch the awk Version as far as i know. I do not know how many open files my awk Version supports, actually i Need to have only two open at a time. If i understand the first post correctly i can Close a file linewise? – Friedrich Sep 28 '15 at 07:41
  • it looks like the problem is in the logic, i.e., in the `if (NR > (vBreak[vcount+1]-1))` loop, the statement `close(str1)` closes the file, but since the filename does not change until the next iteration, the last `printf` redirected to the still original `str1` reopens the file again – ewcz Sep 28 '15 at 07:42
  • 2
    just put the `close` statements directly after `printf` – ewcz Sep 28 '15 at 07:48
  • I put the print commands up and had the if Statements afterwards, that solved the error message Problem, now i moved the prints and the Close commands downwards, I'll see what works better for me. Thanks a lot! – Friedrich Sep 28 '15 at 07:54
  • @tripleee: I still Need an answer to this question: One can Close files linewise using awk? – Friedrich Sep 28 '15 at 08:00
  • Apart the real problem which has been pointed by EWCZ, `Count++;` should probably be `count++;`. Your `FAv` and `FAh` arrays are equal, you can probably simplify a bit by using only one. If you want to use GNU awk you can install macport and then GNU awk. – Renaud Pacalet Sep 28 '15 at 08:03
  • The "Count++" is a gift from Internet exlorer i am using right now because it thinks it has to correct me all the time, in my code it is "count++". The Arrays are the same, i had them both because of the different Counters and readability, maybe i'm going to change that later on when the scipt works correctly. I have brew installed already but no Internet at my workplace, so i wont be able to try that out, apart from that i am happy with the mac awk at the Moment - do you know some "drawbacks" from it? – Friedrich Sep 28 '15 at 08:10

0 Answers0