-2

The following piece of code is supposed to load and prepare datasets from a specified directory for further data analysis. The problem is that the code generates the following errors during attempts to merge data (one for each merging option). I'm confused about what is going on here. However, my gut feeling tells me that the errors might be due to the absence of column names in some data frames. I would appreciate a clarification. Also, please advise on the preferred merging option (between #1 and #2). Thank you!

UPDATE 2 (Reworked with minimal reproducible example, previous versions removed):

Current error (Merging Option 1 enabled):

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

Current error (Merging Option 2 enabled):

Error in `[.data.frame`(x, rep.int(NA_integer_, nyy), nm.x, drop = FALSE) : undefined columns selected

Required Packages: none, other than standard ones.

Source Code (includes reproducible data):

# load the datasets of transformed data

# real data
#dataSets <- loadDataSets(SRDA_DIR)

# reproducible example data
# (generated via `dput(lapply(dataSets, head))`, thanks to @MrFlick)

dataSets <- list(structure(list(`NA` = c("284", "284", "284", "284", "284", 
"284"), `NA` = c("490", "490", "490", "490", "490", "490")), .Names = c(NA_character_, 
NA_character_), SQL = structure("ClNFTEVDVCBnLmdyb3VwX2lkLCB1LnVzZXJfaWQKRlJPTSBzZjA1MTQuZ3JvdXBzIGcsIHNmMDUxNC51c2VycyB1LCBzZjA1MTQucHJvamVjdF9oaXN0b3J5IHBoLCBzZjA1MTQucHJvamVjdF90YXNrIHB0LCBzZjA1MTQucHJvamVjdF9ncm91cF9saXN0IHBnbApXSEVSRSBwaC5wcm9qZWN0X3Rhc2tfaWQgPSBwdC5wcm9qZWN0X3Rhc2tfaWQKQU5EIHB0Lmdyb3VwX3Byb2plY3RfaWQgPSBwZ2wuZ3JvdXBfcHJvamVjdF9pZApBTkQgZy5ncm91cF9pZCA9IHBnbC5ncm91cF9pZApBTkQgcGgubW9kX2J5ID0gdS51c2VyX2lk", class = "base64"), indicatorName = structure("Y29udHJpYlBlb3BsZQ==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("85684", 
"172552", "228484", "173865", "94140", "179097"), Enabled = c("1", 
"1", "1", "1", "1", "1"), `Repo URL` = c("http://svn.tr51.org/svn/variomat/trunk/", 
"http://svn.hyperic.org/?root=Hyperic+SIGAR", "http://code.google.com/p/ufolder/source/browse", 
"https://svn.canoo.com/trunk/webtestclipse/", "http://www.rasilon.net/svn/sptools/trunk/sptools", 
"http://trac.pocoo.org/repos/pygments"), `Repo Instructions` = c("Login is currently disabled.", 
"For anonymous access, simply issue the command 'svn co http://svn.hyperic.org/projects/sigar'  For developer access, send email to sourceforge user &quot;hyperic&quot;.", 
"Anonymous browsing", "https://svn.canoo.com/trunk/webtestclipse/", 
"SVN stuff to go here.  If you just want a copy of the source, run svn co http://www.rasilon.net/svn/sptools/trunk/sptools", 
"The Subversion repository is at http://trac.pocoo.org/repos/pygments."
)), .Names = c("Project ID", "Enabled", "Repo URL", "Repo Instructions"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZW5hYmxlZCwgdXJsX3ByaW1hcnksIGluc3RydWN0aW9uc19wdWJsaWMKRlJPTSBzZjA1MTQuZXh0ZXJuYWxfdG9vbF9saW5rcw==", class = "base64"), indicatorName = structure("ZGV2TGlua3M=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRW5hYmxlZCwgUmVwbyBVUkwsIFJlcG8gSW5zdHJ1Y3Rpb25z", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1343228", 
"230959", "1938195", "1883362", "404683", "650286"), `NA` = c("6", 
"6", "6", "6", "6", "6"), `NA` = c("http://sourceforge.net/p/aprpg/discussion", 
"http://sourceforge.net/project/memberlist.php?group_id=230959", 
"http://www.polishavenue.com", "http://sourceforge.net/p/wakemypc/tickets", 
"http://sourceforge.net/apps/trac/graphz/", "http://testando1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgcHJlZmVycmVkX3N1cHBvcnRfdHlwZSwgcHJlZmVycmVkX3N1cHBvcnRfcmVzb3VyY2UKRlJPTSBzZjA1MTQuZ3JvdXBzCldIRVJFIHByZWZlcnJlZF9zdXBwb3J0X3R5cGUgPSA2", class = "base64"), indicatorName = structure("ZGV2U3VwcG9ydA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("1692507", 
"1095949", "685064", "900864", "976917", "1949934"), `Development Team Size` = c(1, 
1, 1, 1, 1, 1)), .Names = c("Project ID", "Development Team Size"
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgQ09VTlQodXNlcl9pZCkKRlJPTSBzZjA1MTQudXNlcl9ncm91cApXSEVSRSBncmFudGN2cyA9IDEKR1JPVVAgQlkgZ3JvdXBfaWQ=", class = "base64"), indicatorName = structure("ZGV2VGVhbVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgRGV2ZWxvcG1lbnQgVGVhbSBTaXpl", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1844416", 
"1849571", "1850512", "1850521", "1854556", "1855148"), `NA` = c("0", 
"0", "0", "0", "0", "0"), `NA` = c("1", "1", "1", "1", "1", "1"
)), .Names = c(NA_character_, NA_character_, NA_character_), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2ksIHVzZV9mb3J1bQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("ZG1Qcm9jZXNz", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("2107960", 
"2068039", "2156229", "2068032", "2068046", "2081469"), `Project Age` = c(5, 
6.5, 4, 6.5, 6.5, 6)), .Names = c("Project ID", "Project Age"
), row.names = c(NA, 6L), class = "data.frame"), structure(list(
    `Project ID` = c("708994", "1586967", "581072", "738614", 
    "758081", "782990"), `Project License` = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("", "afl", "apache", "apache2", 
    "artistic", "boostlicense", "bsd", "cddl", "eclipselicense", 
    "educom", "fair", "gpl", "ibm", "ibmcpl", "iosl", "jabber", 
    "lgpl", "mit", "mpl", "mpl11", "ms-rl", "nasalicense", "ncsa", 
    "nethack", "none", "nposl3", "osl", "other", "php", "php-license", 
    "psfl", "public", "publicdomain", "python", "qpl", "sissl", 
    "sunpublic", "website", "wxwindows", "zlib", "zope"), class = "factor"), 
    `License Restrictiveness` = structure(c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("Highly Restrictive", 
    "Permissive", "Restrictive", "Unknown"), class = "factor")), .Names = c("Project ID", 
"Project License", "License Restrictiveness"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDUxNC5ncm91cHM=", class = "base64"), indicatorName = structure("cHJqTGljZW5zZQ==", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("2", 
"3", "7", "11", "12", "14"), `Latest Release` = c("Snapshots", 
"7.5", "gedit 0.9.5", "r2-00", "0.9.7", "dhiggen_merge-5.0"), 
    `Project Maturity` = structure(c(NA, 3L, 1L, 3L, 1L, 3L), .Label = c("Alpha/Beta", 
    "Stable", "Mature"), class = "factor")), .Names = c("Project ID", 
"Latest Release", "Project Maturity"), SQL = structure("ClNFTEVDVCBmcC5ncm91cF9pZCwgTUFYKGZyLm5hbWUpCkZST00gc2YwNTE0LmZyc19wYWNrYWdlIGZwLCBzZjA1MTQuZnJzX3JlbGVhc2UgZnIsIHNmMDUxNC5mcnNfc3RhdHVzIGZzCldIRVJFIGZwLnBhY2thZ2VfaWQgPSBmci5wYWNrYWdlX2lkCkdST1VQIEJZIGZwLmdyb3VwX2lk", class = "base64"), indicatorName = structure("cHJqTWF0dXJpdHk=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgTGF0ZXN0IFJlbGVhc2U=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`NA` = c("1660372", 
"1590394", "1590772", "85777", "1591062", "1591181"), `NA` = c("0", 
"0", "0", "0", "0", "0")), .Names = c(NA_character_, NA_character_
), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgdXNlX3dpa2kKRlJPTSBzZjA1MTQuZ3JvdXBz", class = "base64"), indicatorName = structure("cHViUm9hZG1hcA==", class = "base64"), resultNames = structure("TkE=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(ID = c("141", "66", 
"55", "45", "75", "80"), `Software Type` = c("Clustering", "Database", 
"Desktop", "Development", "Financial", "Games")), .Names = c("ID", 
"Software Type"), SQL = structure("ClNFTEVDVCB0cm92ZV9jYXRfaWQsIGRlc2NyaXB0aW9uCkZST00gc2YwNTE0LnRyb3ZlX2Zyb250cGFnZQ==", class = "base64"), indicatorName = structure("c29mdHdhcmVUeXBl", class = "base64"), resultNames = structure("SUQsIFNvZnR3YXJlIFR5cGU=", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"), structure(list(`Project ID` = c("142", 
"129", "120", "119", "107", "106"), `User Community Size` = c("153237", 
"3299", "135710", "16249", "6042", "2508")), .Names = c("Project ID", 
"User Community Size"), SQL = structure("ClNFTEVDVCBncm91cF9pZCwgZG93bmxvYWRzCkZST00gc2YwNTE0LnN0YXRzX3Byb2plY3RfYWxs", class = "base64"), indicatorName = structure("dXNlckNvbW11bml0eVNpemU=", class = "base64"), resultNames = structure("UHJvamVjdCBJRCwgVXNlciBDb21tdW5pdHkgU2l6ZQ==", class = "base64"), row.names = c(NA, 
6L), class = "data.frame"))

# Merging Option 1

flossData <- data.frame(dataSets[[1]][1])

# merge all loaded datasets by common column ("Project ID")
silent <- lapply(seq(2, length(dataSets) - 1),
                 function(i) {merge(flossData, dataSets[[1]][i],
                                    by = "Project ID",
                                    all.y = TRUE)})

# Merging Option 2

#flossData <- Reduce(function(...) 
#  merge(..., by.x = "row.names", by.y = "Project ID", all = TRUE),
#  dataSets)

# Additional Transformations

# convert presence of Repo URL to integer
flossData[["Repo URL"]] <- as.integer(flossData[["Repo URL"]] != "")

# convert License Restrictiveness' factor levels to integers
#flossData[["License Restrictiveness"]] <- 
#  as.integer(flossData[["License Restrictiveness"]])

# convert User Community Size from character to integer
flossData[["User Community Size"]] <- 
  as.integer(flossData[["User Community Size"]])

# remove NAs
#flossData <- flossData[complete.cases(flossData[,3]),]
rowsNA <- apply(flossData, 1, function(x) {any(is.na(x))})
flossData <- flossData[!rowsNA,]

Environment:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plspm_0.4.1   turner_0.1.7  tester_0.1.7  diagram_1.6.2 shape_1.4.1   amap_0.8-12  

loaded via a namespace (and not attached):
[1] tools_3.1.1

UPDATE 3:

An attempt to merge data frames, using reshape package (reshape::merge_all(dataSets)), resulted in the following error message: Error: cannot allocate vector of size 332.8 Gb. This is quite strange, considering that the total size of R objects stored in that directory and being merged is only 4.3 MB.

An attempt to merge data fames, using plyr package (plyr::join_all(dataSets)), resulted in the following error message: Error in ``[.data.frame``(x, by) : undefined columns selected. This seems to match the error message in the Merging Option 2.

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • 2
    you're attempting to merge dataSets[i] with *nothing* – rawr Aug 03 '14 at 04:01
  • @rawr: I thought that `merge(, A)` results in `A`... – Aleksandr Blekh Aug 03 '14 at 04:13
  • Just tried `Reduce(function(...) merge(..., all=T), dataSets)` instead of `lapply/merge`, but, unfortunately, result is the same. – Aleksandr Blekh Aug 03 '14 at 04:30
  • @farnsy: I just tried your advice, but, unfortunately, with no effect... – Aleksandr Blekh Aug 03 '14 at 04:36
  • 1
    Yeah, so did I---probably should have done that first. Looks like `merge()` will not work when one of the data frames is empty. Seems like poor design by the authors to me. Better use an `if()` to isolate the initial case and just return the table instead of merging if `NROWS(flossData)==0`. – farnsy Aug 03 '14 at 04:41
  • 1
    Alternately, just stick something in flossData at initialization. `flossData<-data.frame(tmp=NA)` would be enough. Definitely mind the `all.*` parameters as well, of course. – farnsy Aug 03 '14 at 04:43
  • 2
    if you have a list of data sets, `Reduce(function(...) merge(..., by = 'Project ID', all = TRUE), dataSets)` should work, just dont use the empty one. Or use a merge function for multiple data sets, something like `join_all` from plyr (but you may need to use list2env to make each an object in your workspace) – rawr Aug 03 '14 at 10:55
  • @rawr: Thanks a lot! I will try your recommendations - the problem seems to be a combination of several different issues within the same code. Will keep you posted on the progress. – Aleksandr Blekh Aug 03 '14 at 11:53
  • Updated the question with fixed code, as well as error and debugging output. Advice is appreciated! – Aleksandr Blekh Aug 03 '14 at 14:35
  • @rawr: Updated the question with fixed code, as well as error and debugging output. Advice is appreciated! – Aleksandr Blekh Aug 03 '14 at 15:03
  • I'm wondering whether people, downvoting my question, are brave enough to publicly express their opinion on the reasons: http://meta.stackexchange.com/questions/135/encouraging-people-to-explain-downvotes. – Aleksandr Blekh Aug 03 '14 at 15:54
  • 3
    it's hard to tell without the data you are working with. The error effectively says you are `mtcars[ , NA]` trying to select an `NA` column in your data. Does the code break at Reduce if you run it line-by-line? Reduce works left to right, so the first data is merged (by row.names) with the second (by project ID), then that resulting data is merged (by row.names) with the third (by project ID), and so on. Would that cause errors? – rawr Aug 03 '14 at 15:54
  • 2
    youre probably getting downvotes because you haven't provided a reproducible example so that we can run your code, get your errors, and see what's going wrong. As of now, we can only speculate, and that's not an efficient way to debug code. – rawr Aug 03 '14 at 15:56
  • @rawr: Thank you for your comments. Re `Reduce`: I will check. Re "reproducible examples": I understand the desirability of REs and, when possible, I gladly provide them. In this case, however, I wasn't sure, whether `dput(head(A))` would be appropriate, if `A` is a list of data frames. – Aleksandr Blekh Aug 03 '14 at 16:03
  • 1
    @AleksandrBlekh I am was trying to reproduced your example and first thing that appears wrong (at least in your example) is that is something wrong with the data.frames you are trying to merge. So `for (i in seq(2, length(dataSets) - 1)){print(dataSets[[1]][i])` will break in i=3 because the data.frame you are calling is empty. Also, when working it is returning single column data.frames, which I find odd considering that you are trying to merge something. } – fridaymeetssunday Aug 05 '14 at 07:49
  • @fridaymeetssunday: Thank you for the feedback. Single column data frames are normal IMHO, as these data frames are result of SQL queries, some of which are designed to return a single column. In regard to the problem of an empty data frame at `i=3`, it might be caused by the way I generate reproducible data, applying the following to a list of data frames: `dput(lapply(dataSets, head))`. It seems to work, but I'm not 100% sure about this. I could check whether data frame is really empty in my real data, if I'd know its name. Could you tell me the name? – Aleksandr Blekh Aug 05 '14 at 08:22

2 Answers2

2
lapply(dataSets, names)
pids = which(sapply(dataSets, FUN=function(x) { 'Project ID' %in% names(x) }))

acc = dataSets[[pids[1]]]

for (id in pids[2:length(pids)]) {
  acc = merge(acc, dataSets[[id]], by='Project ID', all=T)
}

Assumptions I had to make:

  • not all data sets from dataSet have Project ID column, so assumed you need to join only those which have it. So first I find them and put their indexes to pids
  • since there are data sets that don't have any project ids in common, I assumed you want to perform outer join - all=T flag in merge

Regarding your question about what way of joining to use, I'd say it's not a good idea to do this in R. I'd use an RDBMS whenever possible to do joins and other things before loading data to R

Alexey Grigorev
  • 2,415
  • 28
  • 47
  • Thank you, Alexey! I have already fixed my import procedure to make sure most datasets have `Project ID` column. Regardless of that, your code is nice and useful (should I need to merge with datasets lacking that column). In regard to using a DB server to perform most operations, I understand that, but I do it that way for two reasons: 1) DB server for one of the main data sources is very slow, so any more than trivial operations make it crawl; 2) based on the nature of my research, I want to have a freedom to decide when and what data to analyze. Again, thank you for help. Upvoting. – Aleksandr Blekh Aug 06 '14 at 08:15
  • Hmm... Strange thing - when testing, your code generates an error message `Error in which(sapply(dataSets, FUN = function(x) {: argument to 'which' is not logical`. The code looks fine to me. Am I missing something? (dataSets is a list of data frames) – Aleksandr Blekh Aug 06 '14 at 09:19
  • With the provided data it generates no error. Could there, in `dataSets`, be something that is not a data frame? – Alexey Grigorev Aug 06 '14 at 15:50
  • Thanks, I will check that. – Aleksandr Blekh Aug 06 '14 at 20:44
  • Just tested. You're right, the code works fine with my minimal reproducible example. I will have to figure out the issues or specifics in my real data that generate the error. – Aleksandr Blekh Aug 06 '14 at 21:41
  • Finally, I have figured out the sources of the errors. There were two issues with my code: first was that I missed one variable, when was changing their names, second was the presence of extra code non-related to merging. Had to write a test client for that, but it was worth it. – Aleksandr Blekh Aug 07 '14 at 03:51
0

A possible error is that one column name of the dataframe is NA.

bixiou
  • 124
  • 2
  • 10