17

I am sure there is a simple solution to this, but i am going nuts trying to find it. Any help is very much appreciated.

I have a data frame with 2 columns; "pro" and "pep". pro is formatted as factors and contains entries in the form 220;300;4 sometimes more numbers (seperated by ";") and sometimes just a single number (and no ";"). The pep column is formatted as integers and contains single numbers, e.g. 20. What i would like to do is to "expand" e.g. the row pro: 220;300;4 and pep: 20 to three rows one with pro: 220 and pep: 20, one with pro: 300 and pep: 20 and one with pro: 4 and pep: 20.

I want to do this for the whole data frame and thus end up with a data frame with two character formatted columns where all the rows originally containing multiple ";" seperated numbers have been expanded.

I would prefer to avoid loops since the data frame is fairly large (>100000 rows)

I am sorry that i havent been able to post this in a more case-representative way...i am new here and got lost in the code format.

df <- data.frame(
  Protein.Group.IDs = factor(
    c("951", "132", "314;658", "268", "948", "194", "205", "787", "709", "904"),
    levels = c(
      "0", "1", "10", "100", "101", "102", "103", "104", "105", "106", "107", "108",
      "109", "11", "110", "111", "112", "113", "114", "114;680", "115", "116",
      "117", "118", "119", "12", "120", "121", "121;920;530", "121;920;530;589",
      "121;920;530;589;934", "121;920;589", "121;920;934", "122;351", "122;351;950",
      "122;351;950;224;904", "122;351;950;687", "122;901;224;904", "122;901;351",
      "122;901;351;950", "122;901;351;950;224", "122;901;351;950;224;890;904",
      "122;901;351;950;224;890;904;687", "122;901;351;950;890;687", "122;901;950",
      "122;901;950;904;687", "122;950", "123", "124", "125", "126", "127",
      "127;952", "128", "129", "13", "130", "131", "131;204", "132", "133", "134",
      "135", "136", "137", "138", "139", "14", "140", "140;259;436", "141", "142",
      "143", "144", "145", "146", "147", "148", "149", "15", "150", "151", "152",
      "153", "154", "155", "156", "157", "158", "159", "16", "16;331", "16;331;329",
      "16;331;329;62", "16;331;329;910", "16;331;329;910;62", "16;331;62",
      "16;331;910", "160", "161", "162", "163", "164", "165", "166", "166;743",
      "167", "167;595", "168", "169", "17", "170", "170;48", "171", "172", "173",
      "174", "175", "176", "177", "178", "179", "18", "180", "181", "182", "183",
      "184", "185", "186", "187", "188", "188;813", "188;813;852", "189", "19",
      "19;14", "19;6;9;14;11", "19;884;6;9;14;20;26;11;1", "19;9", "19;9;14", "190",
      "190;260", "191", "192", "193", "194", "195", "196", "197", "198", "199", "2",
      "20", "20;26", "200", "201", "202", "203", "204", "205", "206", "207", "208",
      "209", "21", "21;4", "210", "211", "212", "213", "214", "215", "216", "217",
      "218", "219", "22", "220", "221", "222", "223", "224", "224;890",
      "224;890;904", "225", "225;221", "225;221;308", "225;295", "226", "227",
      "228", "228;396", "228;396;73", "228;73", "229", "23", "23;137", "23;17;137",
      "230", "231", "232", "233", "234", "235", "236", "237", "238", "239", "24",
      "240", "241", "242", "242;171", "243", "244", "245", "246", "247", "248",
      "249", "25", "250", "251", "252", "253", "254", "255", "256", "257", "258",
      "259", "26", "260", "261", "262", "263", "264", "265", "266", "267", "268",
      "269", "27", "270", "271", "272", "273", "273;541;905", "273;905", "274",
      "275", "276", "277", "278", "279", "28", "280", "281", "281;192", "282",
      "283", "284", "285", "286", "287", "288", "289", "29", "290", "291", "292",
      "293", "294", "295", "296", "297", "298", "299", "3", "30", "300", "301",
      "302", "303", "304", "304;770", "305", "306", "307", "308", "309", "31",
      "310", "311", "312", "313;293", "314", "314;658", "315", "316", "317", "318",
      "319", "32", "320", "321", "322", "323", "324", "324;34;564;637;282;229;565",
      "324;564;282", "324;637;229;565", "325", "326", "327", "328", "328;586",
      "329", "33", "330", "331", "332", "333", "334", "335", "336", "337", "338",
      "339", "34", "340", "341", "342", "343", "344", "345", "346", "346;523",
      "347", "348", "349", "35", "350", "351", "351;890", "352", "353", "353;277",
      "354", "355", "356", "357", "358", "359", "36", "360", "361", "362", "363",
      "364", "365", "366", "367", "368", "369", "37", "370", "371", "372", "373",
      "374", "375", "376", "377", "377;938", "378", "379", "38", "380", "381",
      "382", "382;147", "383", "384", "385", "386", "387", "388", "389", "39",
      "39;417", "390", "391", "392", "393", "394", "395", "396", "397", "398",
      "399", "399;955", "4", "40", "400", "401", "402", "403", "404", "405", "406",
      "407", "408", "409", "41", "410", "411", "412", "413", "414", "415", "416",
      "417", "418", "419", "42", "420", "421", "422", "423", "424", "424;640",
      "425", "426", "427", "427;930", "428", "429", "43", "430", "431", "432",
      "433", "434", "435", "436", "437", "438", "438;178", "439", "44", "440",
      "441", "442", "443", "444", "445", "446", "447", "448", "449", "45", "450",
      "451", "452", "453", "454", "455", "456", "457", "458", "459", "46", "460",
      "461", "462", "463", "464", "465", "466", "467", "468", "469", "47", "470",
      "471", "472", "473", "474", "475", "476", "477", "478", "479", "48", "480",
      "481", "482", "483", "484", "485", "486", "487", "488", "488;648", "489",
      "49", "490", "491", "492", "493", "494", "495", "496", "497", "498", "499",
      "5", "50", "500", "501", "502", "503", "504", "505", "506", "507", "508",
      "509", "51", "510", "511", "512", "513", "514", "515", "516", "516;603;845",
      "516;603;845;837", "517", "518", "519", "52", "520", "521", "522", "523",
      "524", "525", "526", "527", "527;509", "528", "529", "53", "530", "531",
      "532", "533", "534", "535", "536", "537", "538", "539", "54", "540", "540;67",
      "541", "542", "543", "544", "545", "546", "547", "548", "549", "55", "550",
      "550;549", "551", "552", "553", "554", "555", "556", "557", "558", "559",
      "56", "560", "561", "562", "563", "564", "564;282", "564;637", "565", "566",
      "567", "568", "568;569", "568;569;286", "568;569;574", "568;569;574;286",
      "568;574", "569", "57", "570", "571", "572", "573", "574", "575", "576",
      "577", "578", "579", "579;577;578", "579;577;580", "579;577;580;578", "58",
      "580", "581", "582", "583", "584", "585", "585;609", "586", "587", "587;167",
      "587;167;595", "587;167;595;557", "588", "589", "59", "590", "591", "592",
      "593", "594", "595", "596", "597", "598", "599", "6", "60", "600", "601",
      "601;10", "602", "603", "604", "605", "606", "607", "608", "609", "61", "610",
      "611", "612", "613", "614", "615", "615;269", "615;926;269", "616", "617",
      "618", "619", "62", "620", "621", "622", "623", "624", "625", "626", "627",
      "628", "629", "63", "63;397", "630", "631", "632", "633", "634", "635", "636",
      "637", "638", "639", "64", "64;72", "640", "641", "642", "643", "643;529",
      "644", "645", "646", "647", "648", "649", "65", "650", "651", "652", "653",
      "654", "655", "656", "657", "658", "659", "66", "660", "661", "662", "663",
      "663;819", "664", "665", "666", "667", "668", "669", "67", "670", "671",
      "672", "673", "674", "675", "676", "677", "678", "679", "68", "680", "681",
      "681;97", "682", "683", "684", "685", "686", "687", "688", "689", "69", "690",
      "691", "692", "693", "694", "695", "696", "697", "698", "699", "7", "7;25;5",
      "7;752", "7;752;24", "7;752;25;24;8", "70", "700", "701", "702", "703", "704",
      "705", "706", "707", "708", "709", "71", "710", "711", "712", "713", "714",
      "715", "716", "717", "718", "719", "72", "72;746;944", "72;746;944;772",
      "72;772", "72;927", "720", "721", "722", "723", "724", "725", "726", "727",
      "728", "729", "73", "730", "731", "732", "733", "734", "735", "735;522",
      "735;665", "735;665;522", "735;665;876", "735;876", "735;876;522", "736",
      "737", "738", "739", "74", "740", "741", "742", "743", "744", "745", "746",
      "746;944", "746;944;772", "747", "748", "749", "75", "750", "751", "752",
      "752;24", "753", "754", "755", "756", "757", "758", "759", "76", "76;313",
      "76;313;293", "760", "761", "762", "763", "764", "765", "766", "767", "768",
      "769", "77", "770", "771", "772", "773", "774", "775", "776", "777", "778",
      "779", "78", "780", "781", "782", "783", "784", "785", "786", "787", "788",
      "789", "79", "790", "790;552", "791", "792", "793", "793;863", "794", "795",
      "796", "797", "798", "799", "8", "80", "800", "801", "802", "803", "804",
      "805", "806", "807", "808", "808;21", "809", "81", "810", "811", "812", "813",
      "814", "815", "815;413", "815;777", "815;777;339", "815;777;838", "815;838",
      "816", "817", "818", "818;7;752", "818;7;752;23;25;17;8", "819", "82", "820",
      "821", "822", "823", "824", "824;957", "825", "826", "827", "828", "829",
      "83", "830", "831", "832", "833", "834", "835", "836", "837", "838", "839",
      "84", "840", "841", "842", "843", "844", "845", "846", "847", "847;560;590",
      "848", "849", "85", "850", "850;817", "851", "852", "853", "853;420", "854",
      "855", "856", "857", "858", "858;638", "858;638;409", "859", "86", "860",
      "861", "861;593", "862", "863", "864", "865", "866", "867", "868", "869",
      "869;614", "87", "870", "871", "872", "873", "874", "875", "876", "877",
      "878", "879", "88", "880", "881", "882", "883", "884", "884;6", "884;6;9",
      "885", "886", "887", "888", "888;189", "889", "89", "890", "890;904", "891",
      "891;953", "892", "892;941", "893", "894", "895", "896", "897", "898", "899",
      "9", "90", "900", "901", "901;224", "902", "903", "904", "905", "906", "907",
      "908", "909", "91", "910", "911", "912", "913", "914", "915", "916", "917",
      "918", "918;947", "919", "92", "920;530;589", "920;530;589;934", "921", "922",
      "923", "924", "924;576", "925", "926", "927", "928", "929", "93", "930",
      "931", "932", "933", "934", "935", "936", "937", "938", "939", "94", "940",
      "941", "942", "943", "944", "945", "946", "947", "948", "949", "95", "950",
      "951", "952", "953", "954", "955", "956", "957", "958", "959", "96", "960",
      "961", "962", "963", "964", "965", "966", "967", "97", "98", "99", "99;392"
    )
  ),
  Mod..Peptide.ID = c(23L, 24L, 25L, 26L, 27L, 29L, 30L, 31L, 32L, 33L),
  row.names = c(318L, 344L, 380L, 406L, 409L, 417L, 436L, 462L, 494L, 505L)
)
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
user2938867
  • 187
  • 1
  • 2
  • 9
  • Mads, edit your question by pasting in the output from running the command `dput( head( mydataframe , 10 ) )` (where `mydataframe` is whatever you data frame is called). It may look like gibberish, but this will give us the code to recreate the first 10 rows of your data.frame in our R sessions directly. – Simon O'Hanlon Oct 30 '13 at 20:45
  • 1
    Best practice should now be `tidyr::unnest`, as per Rory's answer below – geotheory Apr 19 '18 at 14:11

4 Answers4

17

I've grown to really love data.table for this kind of task. It is so very simple. But first, let's make some sample data (which you should provide idealy!)

#  Sample data
set.seed(1)
df = data.frame( pep = replicate( 3 , paste( sample(999,3) , collapse=";") ) , pro = sample(3) , stringsAsFactors = FALSE )

Now we use the data.table package to do the reshaping in a couple of lines...

#  Load data.table package
require(data.table)

#  Turn data.frame into data.table, which looks like..
dt <- data.table(df)
#           pep pro
#1: 266;372;572   1
#2: 908;202;896   3
#3: 944;660;628   2

# Transform it in one line like this...
dt[ , list( pep = unlist( strsplit( pep , ";" ) ) ) , by = pro ]
#   pro pep
#1:   1 266
#2:   1 372
#3:   1 572
#4:   3 908
#5:   3 202
#6:   3 896
#7:   2 944
#8:   2 660
#9:   2 628
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • 2
    very nice and great inspiration for a non-`data.table` user (but soon, very soon...). +1! – Henrik Oct 30 '13 at 21:13
  • 1
    @Henrik thanks. Mads I am glad to hear it works! :-) You you may want to *consider* marking this as accepted to show the question has been answered, by ticking the little green check mark next to the answer. You are **not** obliged to do this, but it helps keep the site clean of unanswered questions. See the [**about**](http://stackoverflow.com/about) and [**FAQ**](http://stackoverflow.com/faq) for more info. – Simon O'Hanlon Oct 30 '13 at 21:25
16

I think tidyr's unnest() is what you're looking for.

df <- tibble::tibble(x = 1:2, y = list(c("a", "b", "c"), c("alpha", "beta")))
df
#> # A tibble: 2 x 2
#>       x y        
#>   <int> <list>   
#> 1     1 <chr [3]>
#> 2     2 <chr [2]>
tidyr::unnest(df, cols = y)
#> # A tibble: 5 x 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     1 b    
#> 3     1 c    
#> 4     2 alpha
#> 5     2 beta

Created on 2019-08-10 by the reprex package (v0.3.0)

Rory Nolan
  • 972
  • 10
  • 15
  • as of Aug. 2019, use tidyr::unnest(df) because it seems that unnest is removed from tidyverse. – Renhuai Aug 09 '19 at 14:30
  • 1
    @Renhuai I just tried it again and didn't have a problem but I changed it anyway just in case. The new answer should work for you now. – Rory Nolan Aug 10 '19 at 14:17
3

You have already obtained a nice answer, but it may be useful to dig around in the R toolbox. Here's an example using a function from the splitstackshape package, concat.split.multiple. As the name suggests it "allows the user to split multiple columns at once". Although there is only one concatenated column to split in the current example, the function is convenient because it allows us to reshape the data to a long format in the same call. Using the minimal data set provided by @SimonO101:

library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = "pep", seps = ";", direction = "long")
df2
#   pro time pep
# 1   1    1 236
# 2   3    1 465
# 3   2    1 641
# 4   1    2  16
# 5   3    2 721
# 6   2    2 323
# 7   1    3 912
# 8   3    3 459
# 9   2    3 283

An id variable ('time') is added to differentiate the multiple items ('pep') that is generated for each group ('pro'). If you wish to remove it, just run subset(df2, select = -time)

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Hi Henrik. Thank you very much for your answer. I will keep the splitstackshape in mind next time i bump into something like this. – user2938867 Oct 31 '13 at 09:19
2

If the row lists are concatenated strings initially, tidyr::separate_rows is another very convenient method:

tibble::tibble(x = 1:2, y = list("a,b,c", "alpha,beta")) %>% 
    separate_rows(y, sep=",")

# A tibble: 5 x 2
      x y
  <int> <chr>
1     1 a
2     1 b
3     1 c
4     2 alpha
5     2 beta
>
Holger Brandl
  • 10,634
  • 3
  • 64
  • 63