0

Hi: I made a script to combine different possibilities with files, but my files has 1000 lines each one, and with awk and echo it takes soo long to generate the output file. Is there anyway to do the same faster?

Example:

fileA.txt is:
dog
cat
horse
fish

fileB.txt is:
good
bad
pretty
ugly

I need fileC to be like:
doggood
dogbad
dogpretty
dogugly
catgood
catbad
catpretty
catugly
etc

Here`s the code:

#!/bin/bash
numA=1
while [ $numA -le 1000 ]; do
numB=1
    while [ $numB -le 1000 ]; do
        string1=$(awk "NR==$numA" fileA.txt)    
        string2=$(awk "NR==$numB" fileB.txt)
        string3="$string1$string2"
        echo "$string3" >> fileC.txt
        numB=$(($numB+1))
    done
    numA=$(($numA+1))
done

it will took weeks. I am new to bash scripting, so if someone has any idea, with a code example will be fine. Thanks

Toto
  • 89,455
  • 62
  • 89
  • 125
Gustavo
  • 11
  • 4

3 Answers3

2

Don't use awk to get the current line of the file; it has to read the entire file each time. Just read the files in loop.

while read -r string1; do
    while read -r string2; do
        echo "$string1$string2"
    done < fileB.txt
done < fileA.txt > fileC.txt
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • thanks, i updated the question with the example. Where do I have to replace your code in my script? – Gustavo Aug 09 '21 at 17:29
  • Replace all your code, Gustavo. I just ran both versions, and the outputs match. BTW, your original code added to the output file instead of replacing it, and if there were less than 1000 lines, it never halted. – Andrew Aug 09 '21 at 17:46
2

If one of the files can fit in memory:

awk 'NR==FNR {a[++n]=$0; next} {for (i=1; i<=n; ++i) print $0 a[i]}' fileA fileB

With that example input,

#!/bin/sh -

awk '
  NR==FNR {
      a[++n]=$0
      next
  }

  {
      for (i=1; i<=n; ++i) {
          print $0 a[i]
      }
  }
' fileB.txt fileA.txt > fileC.txt
  • 1
    for a pair of 100-line files this is 4x-5x times faster than a pair of nested `bash/while` loops; for a pair of 300-line files this is 10x-15x times faster; net result ... assuming enough memory to hold the first file ... speed is significantly improved as the number of lines increases – markp-fuso Aug 09 '21 at 16:36
  • thanks...so if I replace my two lines of awk in my script with this one? – Gustavo Aug 09 '21 at 17:30
  • 1
    @rowboat this thing is real fast hehe thanks! – Gustavo Aug 09 '21 at 18:21
0

Just for fun

A hacky way to build the Cartesian product of two files A and B:

xargs -a A -n1 -d\\n xargs -a B -n1 -d\\n printf %s%s\\n

This will be slow too, because xargs starts a new printf process for each line of output. You could drastically speed this up using ...

xargs -a A -d\\n -I{} xargs -a B -d\\n printf {}%s\\n

... but that would make the command unsafe, because printf would interpret % and \ inside lines of file A. To fix this, you can use

sed 's/[%\\]/\\&/g' A | xargs -d\\n -I{} xargs -a B -d\\n printf {}%s\\n
Socowi
  • 25,550
  • 3
  • 32
  • 54
  • @Gustavo Um... I'm really honored that you accepted this answer. However, this script isn't really intended to be actually used. Wouldn't [rowboat's answer](https://stackoverflow.com/a/68715535/6770384) be better? This answer takes 1.5s on files having 1000 lines each. Rowboat's answer does the same in 0.2s, is more readable and less hacky. I would be happy if you could accept that answer instead (a joke answer like this one might attract downvotes when it is in the spotlight). – Socowi Aug 09 '21 at 18:19