difference between as.data.frame and read.csv in R

Question

I would like to do Propensity Score Match with R function matchit, if I read data from a csv file, eveything looks fine and result is what I want:

> csv <- read.csv("C:/Users/Lenovo/Desktop/ddd.csv", header=TRUE)
> df <- as.data.frame(csv)
> df
   PERSON_ID OUTCOME tnb gxy AGE1
1     166920       1   2   0   61
2     167350       1   2   0   65
3     167757       1   1   0   58
4     167812       1   1   0   63
5     168271       1   2   0   55
6     168426       0   2   0   47
7     168652       0   2   1   57
8     168983       0   1   0   51
9     169083       0   2   0   50
10    169172       0   2   1   53
> fm <- matchit(OUTCOME ~ tnb + AGE1, data = df, method = "nearest")
> result <- summary(fm)
> result

Call:
matchit(formula = OUTCOME ~ tnb + AGE1, data = df, method = "nearest")

Summary of balance for all data:
         Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max
distance        0.8334        0.1666     0.2575    0.6667   0.867   0.6667  0.8964
tnb             1.6000        1.8000     0.4472   -0.2000   0.000   0.2000  1.0000
AGE1           60.4000       51.6000     3.7148    8.8000   8.000   8.8000 10.0000


Summary of balance for matched data:
         Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max
distance        0.8334        0.1666     0.2575    0.6667   0.867   0.6667  0.8964
tnb             1.6000        1.8000     0.4472   -0.2000   0.000   0.2000  1.0000
AGE1           60.4000       51.6000     3.7148    8.8000   8.000   8.8000 10.0000

Percent Balance Improvement:
         Mean Diff. eQQ Med eQQ Mean eQQ Max
distance          0       0        0       0
tnb               0       0        0       0
AGE1              0       0        0       0

Sample sizes:
          Control Treated
All             5       5
Matched         5       5
Unmatched       0       0
Discarded       0       0

However if I use arrays to keep input data, then cast them to data.frame, the result matrix has many rows whose row names are not I defined:

> OUTCOME<-c("1", "1", "1", "1", "1", "0", "0", "0", "0", "0");
> PERSON_ID<-c("166920", "167350", "167757", "167812", "168271", "168426", "168652", "168983", "169083", "169172");
> tnb<-c("0", "0", "1", "0", "1", "0", "0", "1", "1", "0");
> gxy<-c("0", "0", "1", "0", "0", "1", "0", "0", "1", "0");
> AGE1<-c("61", "65", "58", "63", "55", "47", "57", "51", "50", "53");
> matrix <- cbind(PERSON_ID,OUTCOME,tnb,gxy,AGE1)
> data <- as.data.frame(matrix, stringsAsFactors= TRUE)
> data
   PERSON_ID OUTCOME tnb gxy AGE1
1     166920       1   0   0   61
2     167350       1   0   0   65
3     167757       1   1   1   58
4     167812       1   0   0   63
5     168271       1   1   0   55
6     168426       0   0   1   47
7     168652       0   0   0   57
8     168983       0   1   0   51
9     169083       0   1   1   50
10    169172       0   0   0   53
> fm <- matchit(OUTCOME ~ tnb + gxy + AGE1, data = data, method = "nearest", replace = TRUE, ratio = 1)
> summary(fm)

Call:
matchit(formula = OUTCOME ~ tnb + gxy + AGE1, data = data, method = "nearest", 
    replace = TRUE, ratio = 1)

Summary of balance for all data:
         Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max
distance           1.0           0.0     0.0000       1.0       1      1.0       1
tnb0               0.6           0.6     0.5477       0.0       0      0.0       0
tnb1               0.4           0.4     0.5477       0.0       0      0.0       0
gxy1               0.2           0.4     0.5477      -0.2       0      0.2       1
AGE150             0.0           0.2     0.4472      -0.2       0      0.2       1
AGE151             0.0           0.2     0.4472      -0.2       0      0.2       1
AGE153             0.0           0.2     0.4472      -0.2       0      0.2       1
AGE155             0.2           0.0     0.0000       0.2       0      0.2       1
AGE157             0.0           0.2     0.4472      -0.2       0      0.2       1
AGE158             0.2           0.0     0.0000       0.2       0      0.2       1
AGE161             0.2           0.0     0.0000       0.2       0      0.2       1
AGE163             0.2           0.0     0.0000       0.2       0      0.2       1
AGE165             0.2           0.0     0.0000       0.2       0      0.2       1


Summary of balance for matched data:
         Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean eQQ Max
distance           1.0           0.0     0.0000       1.0     1.0      1.0       1
tnb0               0.6           0.8     0.5657      -0.2     0.0      0.0       0
tnb1               0.4           0.2     0.5657       0.2     0.0      0.0       0
gxy1               0.2           0.8     0.5657      -0.6     0.0      0.0       0
AGE150             0.0           0.0     0.0000       0.0     0.0      0.0       0
AGE151             0.0           0.2     0.5657      -0.2     0.5      0.5       1
AGE153             0.0           0.0     0.0000       0.0     0.0      0.0       0
AGE155             0.2           0.0     0.0000       0.2     0.5      0.5       1
AGE157             0.0           0.0     0.0000       0.0     0.0      0.0       0
AGE158             0.2           0.0     0.0000       0.2     0.5      0.5       1
AGE161             0.2           0.0     0.0000       0.2     0.5      0.5       1
AGE163             0.2           0.0     0.0000       0.2     0.5      0.5       1
AGE165             0.2           0.0     0.0000       0.2     0.5      0.5       1

Percent Balance Improvement:
         Mean Diff. eQQ Med eQQ Mean eQQ Max
distance          0       0        0       0
tnb0           -Inf       0        0       0
tnb1           -Inf       0        0       0
gxy1           -200       0      100     100
AGE150          100       0      100     100
AGE151            0    -Inf     -150       0
AGE153          100       0      100     100
AGE155            0    -Inf     -150       0
AGE157          100       0      100     100
AGE158            0    -Inf     -150       0
AGE161            0    -Inf     -150       0
AGE163            0    -Inf     -150       0
AGE165            0    -Inf     -150       0

Sample sizes:
          Control Treated
All             5       5
Matched         2       5
Unmatched       3       0
Discarded       0       0

My question is: read.csv returns a data frame, as.data.frame(x) also returns a data frame, why are the results different in R's matchit output?

please format your csv to be displayed as table for easy viewing in your question — user93, Sep 25 '17 at 08:56

score 0 · Answer 1 · answered Sep 25 '17 at 12:57

"My question is: read.csv returns a data frame, as.data.frame(x) also returns a data frame, why are the results different in R's matchit output?"

when you use read.csv, your numerical data probably gets read in as such, and matchit will treat them as numerical. But when you declare your variables as characters:

AGE1<-c("61", "65", "58", "63", "55", "47", "57", "51", "50", "53")

in stead of as numbers:

AGE1<-c(61, 65, 58, 63, 55, 47, 57, 51, 50, 53)

matchit will treat them as categorical.

running str(data) and str(df) should show you this difference.

difference between as.data.frame and read.csv in R

1 Answers1