-2

Two csv files formatted identically like this:

blah@domain.com,Elon,Tusk

I want to output the lines from the first file which match the same email address in the second file

Underflow
  • 13
  • 3
  • Couldn't find an exact duplicate, but [this](https://stackoverflow.com/questions/35734197/how-to-use-awk-to-test-if-a-column-value-is-in-another-file) might be a good start if you know a little bit awk – oguz ismail Aug 03 '20 at 06:49
  • @JamesBrown not the same size, random sizes – Underflow Aug 03 '20 at 16:49

1 Answers1

0

Instead of awk, I use join for this type of task because it's simpler/easier for me to remember e.g. join -t',' -o 1.1,1.2,1.3 <(sort -t',' -k1,1 first.csv) <(sort -t',' -k1,1 second.csv), although I believe that awk is the best tool for this type of task, e.g. awk -F, 'FNR==NR {a[$1]; next}; $1 in a' second.csv first.csv

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • interesting never used join before i will try this asap – Underflow Aug 03 '20 at 07:02
  • 1
    It's only "it's simpler/easier to remember" because it's what you use. For me and awk script is simpler an easier to remember. YMMV. I'm not sure your join will work since you're sorting on the full line but then joining on the 1st field so the order join would consider sorted isn't necessarily the order output by sort, depending on how `,` compares to whatever chars are at the same position in each line. – Ed Morton Aug 03 '20 at 18:50
  • @EdMorton would you mind providing an awk solution? – Underflow Aug 03 '20 at 23:28
  • Absolutely agree @EdMorton that my `join` command reorders the output, but I wasn't able to create an example where join failed and awk provided the correct answer (e.g. `awk -F, 'FNR==NR {a[$1]; next}; $1 in a' second.csv first.csv` vs `join -t',' -o 1.1,1.2,1.3 <(sort first.csv) <(sort second.csv)`). I would very much appreciate if you could show me the problem with my approach – jared_mamrot Aug 03 '20 at 23:52
  • 1
    @jared join works when the input is sorted on the field you're joining on, anything else false into the realm of undefined behavior or, as POSIX puts it, `The files file1 and file2 shall be ordered in the collating sequence of sort -b on the fields on which they shall be joined` and `If the input files are not in the appropriate collating sequence, the results are unspecified.`. So when a failure occurs it will be dependent on your input, your locale and the version of join you're running. It's easy to solve the problem, sort on the first `,`-separated field rather than the whole line. – Ed Morton Aug 03 '20 at 23:59
  • Thankyou! That makes sense - I'll edit the answer. I do think `awk` is the correct tool for the job, but I was hoping that OP would find an `awk` answer themselves based on @oguz ismail's first comment. Thanks for the reply – jared_mamrot Aug 04 '20 at 00:06
  • 1
    You're welcome. As far as a concrete example - idk which locale, etc. it'll fail in as-is so imagine a file containing the lines `axc` and `abxd` where `x` is the separator (instead of `,` to sidestep the locale issue). If you do `sort file` then the output order will be `abxd` then `axc` because `b` comes before `x` and the whole line is being compared. If instead you do `sort -tx -k1,1 file` then the output order will be `axc` then `abxd` because only `a` vs `ab` is being compared now and in that comparison `a` alone comes first because it's shortest of the common substrings (`a`). – Ed Morton Aug 04 '20 at 00:10
  • 1
    @Underflow if you update your question to provide a [mcve] containing concise, testable sample input (2 files, each with multiple lines that cover all of your requirements for common lines, lines unique to one file vs the other, partially matching substrings, regexp metachars, etc.) and the expected output given then input then I expect you'd get an awk answer, either from me or someone else. – Ed Morton Aug 04 '20 at 00:14
  • 1
    this awk answer is perfect thank you @jared_mamrot, marked you as solution – Underflow Aug 14 '20 at 11:28