4

I'm reading a file in bash, taking the values out and saving them to another file. The file has ~100k lines in it, and it takes around 25minutes to read and rewrite them all.

Is there maybe some faster way to write to a file, because now I'm just iterating through the lines, parsing some values and saving them like this:

while read line; do
   zip="$(echo "$line" | cut -c 1-8)"
   echo $zip
done < file_one.txt

Everything works fine, the values are parsed correctly, I just want to know how can I optimize the process (if I even can).

Thanks

Luka
  • 255
  • 1
  • 5
  • 16

4 Answers4

5

The bash loop only slows it down (especially the part where you invoke an external program (cut) once per iteration). You can do all of it in one cut:

cut -c 1-8 file_one.xt
Petr Skocik
  • 58,047
  • 6
  • 95
  • 142
  • This'll print the result for each line back. I'm not sure what the point of the zip-variable assignment in the original code was. – Petr Skocik Jun 11 '15 at 08:43
  • Forgot to mention: I'm using a loop because I will later have to check for some other values after this one. The other value will have to be cut depending on its contents (I will have to add an "if"). Great idea, but I'm not sure if I can really use it when I add that... – Luka Jun 11 '15 at 08:44
  • 1
    Pipe this into your while loop then. It will speed things up a lot. Bash loops are slow, but what's much much much slower is invoking an executable per each iteration, which is what you're doing right now. – Petr Skocik Jun 11 '15 at 08:48
  • Can you explain further please, I'm fairly new at this – Luka Jun 11 '15 at 08:54
  • Starting an external executable takes some overhead. Multiply it by 100K lines and you get a lot of overhead. Bash loops add overhead to each iteration too, but not as much. If you only used bash builtins to get the first 8 characters, it would be much faster (but still slow because of the bash loop). One `cut` is the way to go. – Petr Skocik Jun 11 '15 at 08:59
  • 1
    Oh I understand now :) Thank you very much! – Luka Jun 11 '15 at 09:07
2

Calling cut once for each line is a big bottle neck. Use substring expansion instead to grab the first 8 characters of each line.

while read line; do
   zip=${line:0:8}
   echo $zip
done < file_one.txt
chepner
  • 497,756
  • 71
  • 530
  • 681
1

If you wish to act on a line's substring if it meets some condition, Awk is built for manipulating text files:

awk '{zip=substr($0, 1, 8)} zip == "my match" {print zip}' file_one.txt

In this example substr($0, 1, 8) represents characters 1 through 8 of each line record ($0) of file_one.txt. These substrings are assigned to the zip variable, and only print when matching the text "my match".

If you're unfamiliar with Awk, and routinely have large files needing to be manipulated, I recommend investing some time to learn it. Awk is loads faster and more efficient than bash read loops. The blog post - Awk in 20 Minutes - is a good, quick introduction.

To shave even more time off on large files, you can use an optimized for speed version of Awk called Mawk.

John B
  • 3,566
  • 1
  • 16
  • 20
1

I would go with this, since it only executes the cut once:

while read line; do
   echo $line
done < <(cut -c 1-8 file_one.txt)
Ethan A.
  • 41
  • 1