Subset of Columns based on the same column names

Question

I have 2 data frames. In my dataset, a part of column name with leading zeros or trailing zeros should be ignored when matching columns, such as 009 = 9 or 22.0 == 22. And I want to subset the intersection of column names of two data frames.

As an example:

df1

No  009  237 038.1  22.0  010 
1     2    3     6     3    3
3     1    7     6     5    5
7     5   NA     9     0    6

df2

No    9  237  38.1   010  070  33.5
1     2    3     6     3    2     1
3     1    7     6     5    1     2
7     5   NA     9     0    9     6

The result should be like this:

result.df1

No  009  237 038.1  010
1     2    3     6    3
3     1    7     6    5
7     5   NA     9    6

result.df2

No    9  237  38.1   010
1     2    3     6     3
3     1    7     6     5
7     5   NA     9     0

How can I do it?

benson23 · Accepted Answer · 2023-04-03T03:52:15.303

2

We can use trimws() to remove leading/trailing zero and dot (with the argument whitespace = "[\\.0]"), then use %in% to compare the two colnames and use that to index ([]) the original dataframe.

Save the trimws() results to new variables (clean_col_df1 and clean_col_df2) to improve readability of codes.

clean_col_df1 <- trimws(colnames(df1), whitespace = "[\\.0]")
clean_col_df2 <- trimws(colnames(df2), whitespace = "[\\.0]")

df1[, clean_col_df1 %in% clean_col_df2]
  No 009 237 038.1 010
1  1   2   3     6   3
2  3   1   7     6   5
3  7   5  NA     9   6

df2[, clean_col_df2 %in% clean_col_df1]
  No 9 237 38.1 010
1  1 2   3    6   3
2  3 1   7    6   5
3  7 5  NA    9   0

edited Apr 03 '23 at 03:52

answered Apr 03 '23 at 03:46

benson23

16,369
9
19
38

Hi, Thanks for answer. I follow the code but get logi in vector,is any step I missing? – mashimena Apr 03 '23 at 05:58
1

@mashimena Make sure you copy-pasted everything from my codes. If you have unexpected result, you can edit your question to provide that information (e.g. what is the error, what is your output etc.) – benson23 Apr 03 '23 at 06:59

score 0 · Answer 2 · answered Apr 03 '23 at 04:11

This approach works, although it does produce a warning message due to the "No" column not being able to be coerced to a numeric value:

result.df1 <- df1[,which(as.numeric(colnames(df1)) %in% 
                           as.numeric(colnames(df2)))]
result.df1
#   No 009 237 038.1 010
# 1  1   2   3     6   3
# 2  3   1   7     6   5
# 3  7   5  NA     9   6

result.df2 <- df2[,which(as.numeric(colnames(df2)) %in% 
                           as.numeric(colnames(df1)))]
result.df2
#   No 9 237 38.1 010
# 1  1 2   3    6   3
# 2  3 1   7    6   5
# 3  7 5  NA    9   0

This is the warning message that will pop up:

# Warning messages:
# 1: In as.numeric(colnames(df2)) %in% as.numeric(colnames(df1)) :
#   NAs introduced by coercion
# 2: In as.numeric(colnames(df2)) %in% as.numeric(colnames(df1)) :
#   NAs introduced by coercion

Subset of Columns based on the same column names

2 Answers2