-1

awk command to split an 8GB file into multiple files basis number of rows with new filename and header in each file

I have an 8GB file with 26 column headers. I have to split it into multiple files with each file having 400000 lakhs including header. which means each file should have the header as well.

I have tried multiple commands but even though I am getting the desired output there is one small problem but a weird one.

After the 1st line as the header,the header is inserted again at every 50000th line. For eg after using the below command, I got FileName_28062021_1.txt file. If I open this file I can see the header in 1st , 50001st,100001st,150001st lines: Not sure how to resolve it. Original Command tried:

awk '
    NR==1{header=$0; count=1; print header > "FileName_28062021_" count ".txt"; next }
    !( (NR-1) % 399999){count++; print header > "FileName_28062021_" count ".txt";}
    {print $0 > "FileName_28062021_" count ".txt"}
' FileName_28062021-SourceFile.txt
    
SERVERIF:/data1/tempCheckAWK $ wc -l FileName_28062021-NonSplit.txt
46646575 FileName_28062021-NonSplit.txt

Second AWK command tried

SERVERIF:/data1/tempCheckAWK $ vi tempAWK.sh
awk '
NR==1 { header = $0 }
(NR % 400000) == 1 {
close(out)
out = "FileName_28062021_" (++count) ".txt"
print header > out
}
NR>1 { print > out }
' FileName_28062021-NonSplit.txt

SERVERIF:/data1/tempCheckAWK $ sh tempAWK.sh
SERVERIF:/data1/tempCheckAWK $ ls -ltr
Jun 10 13:43 FileName_28062021-NonSplit.txt
Jun 28 23:56 tempAWK.sh
Jun 28 23:59 FileName_28062021_1.txt
Jun 28 23:59 FileName_28062021_2.txt

....

SERVERIF:/data1/tempCheckAWK $wc -l FileName_28062021_1.txt
400000 FileName_28062021_1.txt

SERVERIF:/data1/tempCheckAWK $grep "Transactions Id" FileName_28062021_1.txt
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code

I have tried other solutions provided on stackoverflow. Still no luck, the header repeats after it repeats after 50000th

  • 1
    Most people living out side India will have no idea what a "lakh" (or a "lac") is, so please don't use Indian words here. – James Z Jun 28 '21 at 16:27
  • Also, do you really have 400 000 000 000 data? – James Z Jun 28 '21 at 16:27
  • Please [edit] your question to include a [mcve] with concise, testable sample input and expected output. – Ed Morton Jun 28 '21 at 16:38
  • There is absolutely nothing in your code that could cause the header to be printed every 50,000 lines. If that is happening then the posted code is not what you're executing. – Ed Morton Jun 28 '21 at 18:01
  • Totally Agree Ed. Thats the reason I was wondering. just now executed the code that you shared was able to execute it without any errors but the headers are still getting repeated SERVERIF:/data1/tempCheckAWK $ wc -l FileName_28062021-NonSplit.txt 46646575 FileName_28062021-NonSplit.txt SERVERIF:/data1/tempCheckAWK $ vi tempAWK.sh awk ' NR==1 { header = $0 } (NR % 400000) == 1 { close(out) out = "FileName_28062021_" (++count) ".txt" print header > out } NR>1 { print > out } ' FileName_28062021-NonSplit.txt :q! SERVERIF:/data1/tempCheckAWK $ sh tempAWK.sh – user16334809 Jun 28 '21 at 18:49
  • SERVERIF:/data1/tempCheckAWK $ ls -ltr Jun 10 13:43 FileName_28062021-NonSplit.txt Jun 28 23:56 tempAWK.sh Jun 28 23:59 FileName_28062021_1.txt Jun 28 23:59 FileName_28062021_2.txt Jun 28 23:59 FileName_28062021_3.txt ... SERVERIF:/data1/tempCheckAWK $wc -l FileName_28062021_1.txt 400000 FileName_28062021_1.txt – user16334809 Jun 28 '21 at 18:50
  • SERVERIF:/data1/tempCheckAWK $grep "Transactions Id" FileName_28062021_1.txt Transactions Id|Transaction Date|Investment Id|External Code Transactions Id|Transaction Date|Investment Id|External Code Transactions Id|Transaction Date|Investment Id|External Code Transactions Id|Transaction Date|Investment Id|External Code Transactions Id|Transaction Date|Investment Id|External Code Transactions Id|Transaction Date|Investment Id|External Code Transactions Id|Transaction Date|Investment Id|External Code Transactions Id|Transaction Date|Investment Id|External Code – user16334809 Jun 28 '21 at 18:50
  • Don't add information in comments where it can't be formatted legibly and can be easily missed. [edit] your question to provide all relevant information. – Ed Morton Jun 28 '21 at 19:09
  • FWIW since there is nothing in your code or mine that could generate a header every 50,000 output lines, my best guess is that every 50,000th line in your input file is a copy of your header line. What does `grep "Transactions Id" FileName_28062021-NonSplit.txt` output - 1 line or many lines? (again - add that to your question, not in a comment). Please create and post a [mcve] that we can test with. Chances are by going through the process of doing that you'll figure out the issue yourself. – Ed Morton Jun 28 '21 at 19:10
  • My Bad! Thanks Ed for pointing this out that the input file itself was the culprit. Will update the grep output as an answer. Hope that is fine – user16334809 Jun 28 '21 at 19:22

2 Answers2

1

Aside from the issue you noticed, your existing script will fail with a syntax error in some awks due to the unparenthesized expression on the right side of output redirection, and it'll fail with a "too many open files" error in some other awks due to not closing the output files as you go.

Do something like this, untested:

awk '
    NR==1 { header = $0 }
    (NR % 400000) == 1 {
        close(out)
        out = "FileName_28062021_" (++count) ".txt"
        print header > out
    }
    NR>1 { print > out }
' FileName_28062021-SourceFile.txt

if you didn't want to hard-code parts of the output file name but instead generate it from the input file name then change:

out = "FileName_28062021_" (++count) ".txt"

to

out = FILENAME
sub(/-[^-.]+/,"_"(++count),out)

or similar.

After more discussion with the OP the problem of repeated header lines in the output was due to repeated header lines in the input.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks Ed for your prompt reply! I just tried the above snippet on my AIX server. Below is the error that I got **Syntax Error The source line is 1.** The error context is NR==1 { header = $0 } (NR% 400000)==1 { close(out) >>> out= <<< "FileName_28062021_"(++count) ".txt" print header > out } NR > 1 { print > out } awk: 0602-502 The statement cannot be correctly parsed. The source line is 1. Could you help in resolving this – user16334809 Jun 28 '21 at 17:21
  • No, sorry, there's no syntax error in my code so I assume you copy/pasted it incorrectly and I can't see what you're trying to run so idk how to fix it. – Ed Morton Jun 28 '21 at 17:24
  • If you simply run my code as written it 100% will not produce a syntax error and I'm about 99.999% sure it'll do exactly what you want, I just can't test it as you haven't provided a [mcve] we can test with. – Ed Morton Jun 28 '21 at 17:34
  • The script I wrote will work in any awk in any shell on every Unix box. Yes, you can include images in your question but a copy/paste of text is always much more appreciated. Ah, I see in the error message you posted it says `The source line is 1` - would I be right in assuming you're trying to cram the whole script into a single line? If so you may have not put `;`s in the right places but more importantly - why? In any case, edit your question to show a copy/paste from your screen of exactly what you're trying to run and exactly the error message you get and then we can help you. – Ed Morton Jun 28 '21 at 17:37
  • Oh yeah! I was trying to cram the whole script in a single line. Will try the way you have mentioned it. – user16334809 Jun 28 '21 at 17:39
  • Why are you trying to squeeze it all into a single line though? Isn't it much clearer multi-line as I wrote it? – Ed Morton Jun 28 '21 at 17:40
  • I cant copy paste the entire prog as I am running this on the PROD environment (no access to Internet) where there is volume of data. At my development environment this issue is not reproducible because this issue ( of repeating header) occurs only after 50000th line. Wil try to capture screenshot. – user16334809 Jun 28 '21 at 17:42
  • I still don't know why you're trying to squeeze it into a single line but assuming you do want to for some reason then either you're not adding `;`s in the right places (i.e. for simplicity after every line that doesn't end with `}` or `{`) or introducing some other syntax error by retyping the command instead of copy/pasting it so just look carefully at what you have typed... – Ed Morton Jun 28 '21 at 17:55
  • Regarding `this issue ( of repeating header) occurs only after 50000th line` - that's why you should create a [mcve] with, say, 10 lines of data and outputting 3 lines at a time instead of relying on 8G of data outputting 400K lines at a time. Code and test a solution for your [mcve] until you're satisfied it's working there and then just change the number from 3 to 400000 to test on your real data. – Ed Morton Jun 28 '21 at 17:59
  • Hi Ed. I have edited my question and added the minimal reproducible example. Have shown that in the split file the header is repeating. – user16334809 Jun 28 '21 at 19:05
  • It's not a [mcve] because a) you're still talking about doing something every 400000 lines instead of, say, every 3 lines, and you still haven't provided sample input that we can test with, e.g. a 10 line file you want split every 3 lines. For all we know `Transactions Id|Transaction Date|Investment Id|External Code` exists every 50,000 lines in your input file. – Ed Morton Jun 28 '21 at 19:07
0
So when I executed the below command to check the number of occurrences of the header in the input file. it gave me lots of records as given below. So the issue was not there in the AWK command but the input file itself. 

SERVERIF:/data1/tempCheckAWK $grep -n "Transactions Id" FileName_28062021-NonSplit.txt
    1:Transactions Id|Transaction Date|Investment Id|External Code
    50001:Transactions Id|Transaction Date|Investment Id|External Code
    100001:Transactions Id|Transaction Date|Investment Id|External Code
    150001:Transactions Id|Transaction Date|Investment Id|External Code