0

I am preparing a phenotype file for a GWAS. I found this conversation helpful but it is not quite what I need to do. I have a large 44k participant txt file (containing all cohort participants). Column1=FID, Column=IID, Coumn3=pseudoID I want to create a 4th column with my phenotype of interest (1=case, 0=control, NA=all other participants). I have 2 separate text files that contain just a column with the pseudoID for my controls and antother txt file for my cases.

(1) How do i create a header for the 4th column?

(2) How do i join the pseudoID from the separate control and case txt file to create a 0 or 1 as required in the 4th column.

(3) How do the remaining empty rows in the 4th column become NA?

I will be using Regenie for the GWAS. I am more familiar with linux less so R. Any help would be appreciated. Thank you.


44k participant file txt

ppl <- data.frame(FID = 1, 
                  IID = c(150023532, 150023457, 150075826, 
                          150065943, 150034923),
                  Pseudo_ID = c("E78GJHI", "E96GH25", "E56HFT7", 
                                "EH87HN7", "ENM8H53"))
ppl
# FID       IID Pseudo_ID
# 1   1 150023532   E78GJHI
# 2   1 150023457   E96GH25
# 3   1 150075826   E56HFT7
# 4   1 150065943   EH87HN7
# 5   1 150034923   ENM8H53

Case txt

case <- c("E78GJHI", "ENM8H53")

Control txt

ctrl <- c("E96GH25", "EH87HN7")

The expected output


Phenotype File result

FID IID Pseudo_ID ICD_10
1 150023532 E78GJHI 1
1 150023457 E96GH25 0
1 150075826 E56HFT7 NA
1 150065943 EH87HN7 0
1 150034923 ENM8H53 1
Gowachin
  • 1,251
  • 2
  • 9
  • 17
Julia
  • 3
  • 1

2 Answers2

0

You can directly construct a vector for that 4th column with the present information and add it to the previous data.frame.

I included code to read the dataset, but created the values directly to test the code on itself.

# ppl <- read.csv("Control.txt", sep = " ")
ppl <- data.frame(FID = 1, 
                  IID = c(150023532, 150023457, 150075826, 
                          150065943, 150034923),
                  Pseudo_ID = c("E78GJHI", "E96GH25", "E56HFT7", 
                                "EH87HN7", "ENM8H53"))
ppl
# FID       IID Pseudo_ID
# 1   1 150023532   E78GJHI
# 2   1 150023457   E96GH25
# 3   1 150075826   E56HFT7
# 4   1 150065943   EH87HN7
# 5   1 150034923   ENM8H53

# case <- readLines(file("Case.txt"))
case <- c("E78GJHI", "ENM8H53")
case
# [1] "E78GJHI" "ENM8H53"

# ctrl <- readLines(file("Control.txt"))
ctrl <- c("E96GH25", "EH87HN7")
ctrl
# [1] "E96GH25" "EH87HN7"

I just add the column and it's defined by the presence of the Pseudo_ID values in the case and control vectors. I bet it can be more readible with other packages but this is for easier understanding. ifelse return a vector of the same size as input with the 2 values. Here if pseudo_IP is in ctrl, it return 0 or else it return NA, same with case.

For a data.frame named df, df$name will read the colunm named and df$name <- ... will edit or if absent create the new column.

ppl$ICD_10 <- ifelse(ppl$Pseudo_ID %in% case, 1, 
                     ifelse(ppl$Pseudo_ID %in% ctrl, 0, NA))
ppl
# FID       IID Pseudo_ID ICD_10
# 1   1 150023532   E78GJHI      1
# 2   1 150023457   E96GH25      0
# 3   1 150075826   E56HFT7     NA
# 4   1 150065943   EH87HN7      0
# 5   1 150034923   ENM8H53      1
Gowachin
  • 1,251
  • 2
  • 9
  • 17
  • thank you so much, this is really well explained and so helpful. Really appreciate your input. – Julia Jul 28 '22 at 10:25
  • I have tried to run the final command (I tried both manual input and reading directly from the file) but I get the following error: Error in '$<- ,data.frame'('tmp*, ICCD_10, value = logical (0)) : replacement has 0 rows, data has 5 – Julia Jul 28 '22 at 15:26
  • This indicate that there is an error with the `ifelse` command. Have you modified the names of the vectors I called `ctrl` and 'case` ? Can you test this command `ifelse(ppl$Pseudo_ID %in% case, 1, ifelse(ppl$Pseudo_ID %in% ctrl, 0, NA))` on 5 lines and show what it returns ? – Gowachin Jul 29 '22 at 08:02
  • It says logical (0) – Julia Jul 29 '22 at 19:07
  • This mean that `ppl` data.frame is not formated as intended or that `case` and `ctrl` are empty. `logical(0)` is an empty vector. – Gowachin Aug 08 '22 at 08:48
0

Is this what you are trying to do? It might not be the most efficient one but you could do the following.

case file (add header and create $2 with status ($2=1 for cases))

awk 'BEGIN{print "Pseudo_ID","ICD_10"}; { print $1,$2=1}' OFS=" " case.txt > case_1.txt

control file (do not add header but create $2 with status ($2=0 for controls))

awk '{ print $1,$2=0}' OFS=" " control.txt > control_1.txt 

Merge the two files together

cat case_1.txt control_1.txt > case_control.txt

Match case_control.txt with the phenotype file to get the desired output

awk 'BEGIN {FS=OFS=" "} NR==FNR {a[$1]=$2;next}{print $0, ($3 in a ? a[$3]:"NA")}' case_control.txt phenotype.txt 

FID IID Pseudo_ID ICD_10
1 150023532 E78GJHI 1
1 150023457 E96GH25 0
1 150075826 E56HFT7  NA
1 150065943 EH87HN7 0
1 150034923 ENM8H53 1
DSTO
  • 265
  • 1
  • 9
  • Final question, how can I ensure that all 4 cloumns are tab seperated? I think it is not reading the 4th column when I am trying to run the GWAS. I think the 4th column may be space seperated. Thank you. – Julia Jul 29 '22 at 21:27
  • `awk -F " " 'NR==FNR {a[$1]=$2;next}{print $0, ($3 in a ? a[$3]:"NA")}' OFS="\t" case_control.txt phenotype.txt` – DSTO Jul 29 '22 at 21:40
  • 1
    Thank you that's great. Fixed it. It is now not recognising the headers. Anyway to fix that? thank you so much. – Julia Jul 29 '22 at 21:43