0

I have to datasets from these links: cmu: http://lib.stat.cmu.edu/S/Harrell/data/descriptions/titanic.html kaggle: https://www.kaggle.com/c/titanic-gettingStarted/data

When I try to merge them, my columns to the right repeat, any way I can fix this? I am trying to compare the "Fare" to the people. Mostly trying to learn merge.

cmu <- read.csv("titanic_cmu.txt")
kaggle <- read.csv("titanic_kaggle.csv")
tdata <- merge(cmu, kaggle)

output:

> head(tdata)
  row.names pclass survived                                            name     age    embarked                       home.dest room     ticket  boat    sex
1         1    1st        1                    Allen, Miss Elisabeth Walton 29.0000 Southampton                    St Louis, MO  B-5 24160 L221     2 female
2         2    1st        0                     Allison, Miss Helen Loraine  2.0000 Southampton Montreal, PQ / Chesterville, ON  C26                  female
3         3    1st        0             Allison, Mr Hudson Joshua Creighton 30.0000 Southampton Montreal, PQ / Chesterville, ON  C26            (135)   male
4         4    1st        0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton Montreal, PQ / Chesterville, ON  C26                  female
5         5    1st        1                   Allison, Master Hudson Trevor  0.9167 Southampton Montreal, PQ / Chesterville, ON  C22               11   male
6         6    1st        1                              Anderson, Mr Harry 47.0000 Southampton                    New York, NY E-12                3   male
  PassengerId Survived Pclass                    Name  Sex Age SibSp Parch    Ticket Fare Cabin Embarked
1           1        0      3 Braund, Mr. Owen Harris male  22     1     0 A/5 21171 7.25              S
2           1        0      3 Braund, Mr. Owen Harris male  22     1     0 A/5 21171 7.25              S
3           1        0      3 Braund, Mr. Owen Harris male  22     1     0 A/5 21171 7.25              S
4           1        0      3 Braund, Mr. Owen Harris male  22     1     0 A/5 21171 7.25              S
5           1        0      3 Braund, Mr. Owen Harris male  22     1     0 A/5 21171 7.25              S
6           1        0      3 Braund, Mr. Owen Harris male  22     1     0 A/5 21171 7.25              S
cchamberlain
  • 17,444
  • 7
  • 59
  • 72
Redspart
  • 3
  • 1
  • 3
  • 1
    I'm unsure all.x and all.y will help (although the question seems very unclear.) The problem might be that if there are `n` of a matching criterion in x and `m` in y then there will be `n * m` entries in the returned value from the merge. – IRTFM Feb 03 '15 at 20:49
  • @agstudy Hi there, I completely understand, I just realized it might not make sense. Well when I merged them, I get something like so: http://imgur.com/cMtCfAF , . Basically the two data sets are stuck tog there, from what I can understand, but it repeats? I have read ?merge, and I have tried all.x & all.y but it produces 0 objs. Note:I would of posted this in my question, but I do not have a reputation of 10. – Redspart Feb 03 '15 at 21:04
  • Well ..._that_ was unhelpful. Why cannot you copy text (both code and output) from your console and paste it into your question? – IRTFM Feb 03 '15 at 21:07
  • So it would seem likely that there would be multiple rows in both datasets where columns of the same name might have identical values. ... because the datasets have a lot of factors and columns of same name. Hence you get n * m copies of the other data for each of the common combinations. Step back and ask yourself "why am I merging?" and "what do I want to have as merge criteria?" – IRTFM Feb 03 '15 at 21:12
  • This would be clarified by posting `names(cmu)` and `names( kaggle)`. – IRTFM Feb 03 '15 at 21:15
  • You could use the join functions from the `dplyr` package. In particular, `left_join` and `inner_join` will help you manage instances where there are duplicate values in one of the data frames. Here is a [cheatsheet](https://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html) on the different kinds of joins. – Sam Firke Feb 03 '15 at 21:20
  • To merge on the passenger names (is this what you want to do?), you will need to do some more data cleaning based on just this small sample. You will also need to get very familiar with the `by` arguments in `merge`. – vpipkt Feb 03 '15 at 21:28

0 Answers0