2

I am trying to convert all ICD codes in a tab separated file to Phecodes (based on a ICD-Phecode conversion table tab separated file) for a biology bioinformatics project. I found a good starting point with the code from the below stackoverflow post:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE

Replacing values in large table using conversion table

But I don't want "all values in the first column have been changed according to the conversion table" (above code) I want all values in all columns in 002.txt to be changed according to the conversion table ICD9toPhecode.txt and ICD10toPhecode.txt. So I changed the awk script to the below but it's not working it's not doing anything:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD9toPhecode.txt 002.txt
awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD10toPhecode.txt 002.txt

The first column in ICD9toPhecode.txt and ICD10toPhecode.txt is the ICD9 or ICD10 code and the second column is the Phecode.

Every column in 002.txt is a ICD9 or ICD10 code.

EDIT: It's still not working How do I write to the file?

Here is the anonymized patient data 002.txt sample of ICD10 codes which is OLD_FILE

1   2   3   4   5   6   7   8
K40.9   K43.9   N20.0   N20.1   N23 N39.0   R69 Z88.1
B96.8   D12.6   E11.6   E87.6   I44.7   K40.9   K43.9   K52.9
NOT

Here is the conversion table (ICD10toPhecode.txt) or TABLE

icd10cm phecode
K40.9   550.1
K43.9   550.5
N20.0   594.1
N20.1   594.3
N23 594.8
N39.0   591
R69 1019
Z88.1   960.1
B96.8   041
D12.6   208
E11.6   250.2
E87.6   276.14
I44.7   426.32
K40.9   550.1
K43.9   550.5
K52.9   558
XNO    17

This is what I should get (ICD10 codes converted to Phecodes) (002_output.txt):

1   2   3   4   5   6   7   8
550.1   550.5   594.1   594.3   594.8   591 1019    960.1
041 208 250.2   276.14  426.32  550.1   550.5   558

But what I actually get in 002_output.txt is a repeat of 002.txt

What I need to know is how to change:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD9toPhecode.txt 002.txt
awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD10toPhecode.txt 002.txt

Specifically change ICD10toPhecode.txt 002.txt

I need to write the output to 002_output.txt. It can't be as simple as

ICD10toPhecode.txt 002.txt > 002_output.txt

that outputs the same thing as 002.txt

TESTABLE TEST CASE (for tables see the code snippets I posted above with those names):

awk '
   # Ignore header
   NR==1{ next }
   # Load first file
   FNR==NR { a[$1]=$2; next }
   {
      # Foreach value
      for (i = 1; i <= $NR; ++i) {
          # if the value is in second file
          if ($i in a) {         
                # then replace it
                $i = a[$i]       # NOTE - $i __not__ $1 !
          }
      }
      # print it!
      print
   }
' ICD10toPhecode.txt 002.txt > 002_output.txt

BASED ON:

awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE

I'm pretty sure in my TESTABLE TEST CASE I messed up my for loop maybe FNR==NR { a[$1]=$2; next } I need to link $1 ICD codes with $2 Phecodes in ICD10toPhecode.txt and replace the ICD codes with Phecodes in all fields in 002.txt (MORE THAN ONE COLUMN)

  • 1
    OK I formatted it. – ProfHariSeldon Jun 07 '21 at 22:10
  • Is `002.txt` an example of a `TABLE` or of a `OLD_FILE`? Whichever it is - also show us the sample of the other input file(`ICD9toPhecode.txt`?) that would be used to produce the expected output you provided. We need you to give us something that clearly demonstrates your requirements and problem AND which we can copy/paste to test a potential solution against. – Ed Morton Jun 07 '21 at 22:14
  • 002.txt is OLD_FILE and ICD9toPhecode.txt is TABLE (conversion table) – ProfHariSeldon Jun 08 '21 at 06:03
  • OK then update your question to include the TABLE that when applied to 002.txt would produce the expected output. Right now you seem to be showing an OLD_FILE (`Here is 002.txt sample which is OLD_FILE`) but then saying it should turn into a TABLE (`Here is what it should turn into (ICD9toPhecode.txt) or TABLE`) rather than saying given an OLD_FILE and a TABLE the output should be something new. It's extremely unclear what you're trying to do though it sounds like it should be extremely simple. – Ed Morton Jun 08 '21 at 13:20
  • What I need to know is what to put after the `awk '` see code block #2 in my OP right now I do `awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { for (i = 1; i <= $NR; ++i) $i=a[$1] }1' ICD10toPhecode.txt 002_ICD.txt` but I need to write the output to 002_output.txt. It can't be as simple as `ICD10toPhecode.txt 002_ICD.txt > 002_output.txt` that outputs the same thing as 002_ICD.txt – ProfHariSeldon Jun 08 '21 at 14:44
  • Please don''t add information in comments. Make sure all relevant information is in your question and, most importantly at this point, show us a [mcve] that contains concise, testable sample input and expected output that we can copy/paste to test a potential solution with. See [ask] if that's not clear. – Ed Morton Jun 08 '21 at 14:45
  • OK I added the information in EDIT in the OP. The tables and contents (as code snippets) are listed in the OP and the code I am trying to get to work is in TESTABLE TEST CASE section. The problem is 002_output.txt is the same as 002.txt so the AWK does nothing – ProfHariSeldon Jun 08 '21 at 14:56
  • We're having a disconnect somehow. Is the block of text under `Here is what it should turn into (ICD10toPhecode.txt) or TABLE` sample input (i.e. sample contents of `ICD10toPhecode.txt`) or expected output (i.e. the desired result of running `awk 'script' ICD10toPhecode.txt 002_ICD.txt`)? – Ed Morton Jun 08 '21 at 15:01
  • OK I fixed the tables they should make sense now. I'm pretty sure I messed up my for loop maybe FNR==NR { a[$1]=$2; next } I need to link $1 ICD codes with $2 Phecodes in ICD10toPhecode.txt and replace the ICD codes with Phecodes in all fields in 002.txt (MORE THAN ONE COLUMN) – ProfHariSeldon Jun 08 '21 at 15:32
  • Ok, NOW it makes sense and we can help you. – Ed Morton Jun 08 '21 at 15:38
  • Thanks! FYI the real 002.txt has more than 2 rows. There is the header than row after row of ICD codes. – ProfHariSeldon Jun 08 '21 at 15:43
  • Then you should show more than 1 non-header row in your example, e.g. 2 of them. Also if some values in 002 might not be present in PHecode then you should include some of those in your example too so we can see how those should be handled (left alone or replaced with "N/A" or error message and exit, or something else?) – Ed Morton Jun 08 '21 at 15:46
  • Ok I added more than 1 non-header row in 002.tsv and added a ICD and phecode that do not appear in 002.tsv (`XNO 17`). – ProfHariSeldon Jun 08 '21 at 15:57
  • Having a value in ICD10toPhecode.txt that isn't in 002.txt is one useful test case, the other related one that's far more likely to uncover breakage in a potential solution and show us how to handle a requirement that hasn't been stated or shown yet is having a value in 002.tx that isn't in ICD10toPhecode.txt. – Ed Morton Jun 08 '21 at 17:09
  • Ok I added that value to 002.txt and now my ICD_test.txt output file is blank. I think we should start talking about how I did the for loop wrong. – ProfHariSeldon Jun 08 '21 at 18:38
  • I already told you in my answer what you did wrong in your for loop, please just read that. – Ed Morton Jun 08 '21 at 18:40
  • When you say you added that value - do you mean the word `NOT` on it's own at the end of `002.txt`? I was expecting some value in the middle of one of the existing lines so we could see if that value remains as-is or gets replaced with `N/A` or something. Please [edit] your question to show THAT (e.g. replace `E87.6` with `NOT` in your input and update the output accordingly) so it's obvious if we are supposed to just delete values like that or replace them or do something else. – Ed Morton Jun 08 '21 at 18:44
  • When you say `now my ICD_test.txt output file is blank` - are you referring to what happens when you run the script in my answer or the script in your question or something else? – Ed Morton Jun 08 '21 at 18:45

2 Answers2

1

The loop has to be outside of the condition. Ie. you want to check for each column, not only for $1 in a. Consider a more readable multiline format.

awk '
   # Ignore header
   NR==1{ next }
   # Load first file
   FNR==NR { a[$1]=$2; next }
   {
      # Foreach value
      for (i = 1; i <= $NR; ++i) {
          # if the value is in second file
          if ($i in a) {         
                # then replace it
                $i = a[$i]       # NOTE - $i __not__ $1 !
          }
      }
      # print it!
      print
   }
'
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • 1
    It's still doing nothing. How do I write to the file? Here is 002.txt sample 1 2 3 4 5 6 7 8 K40.9 K43.9 N20.0 N20.1 N23 N39.0 R69 Z88.1 Here is what it should turn into 1 2 3 4 5 6 7 8 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 But what I get is no change: 1 2 3 4 5 6 7 8 K40.9 K43.9 N20.0 N20.1 N2.3 N39.0 R6.9 Z88.1 – ProfHariSeldon Jun 07 '21 at 15:43
  • 1
    edit your question and post there sample input and expected output. Comments do not preserve lines. – KamilCuk Jun 07 '21 at 16:13
  • OK I did that see EDIT: – ProfHariSeldon Jun 07 '21 at 16:45
0

The bugs I see in your code are using $NR instead of NF in your loop, skipping the first line of the 2nd file instead of printing it as-is, and not using tabs as the in/out separators. This is apparently what you need:

$ awk '
    BEGIN { FS=OFS="\t" }
    NR==FNR { map[$1]=$2; next }
    FNR>1 {
        for (i=1; i<=NF; i++) {
            if ($i in map) {
                $i = map[$i]
            }
        }
    }
    { print }
' ICD10toPhecode.txt 002_ICD.txt
1       2       3       4       5       6       7       8
550.1   550.5   594.1   594.3   594.8   591     1019    960.1
041     208     250.2   276.14  426.32  550.1   550.5   558
Ed Morton
  • 188,023
  • 17
  • 78
  • 185