1

There is this 8GB CSV file on my disk. It has one "match" per row.

A "match" consists of some data like id, date, and winner. But it also has 10 players in it with all their data. Those are stored in participants.0.stats.visionScore, participants.1.stats.visionScore, ..., participants.0.stats.assists, ..., participants.9.stats.assists, ... i think you get the pattern. It's just participants.{number}.stats.{variable_name}. Every participant has literally hundreds of stats; that's why I have about 4,000 columns total.

I read the data in like this:

> d <- fread("Matches.csv")
> head(d)
   participants.1.stats.totalDamageDealt
1:                                118504
2:                                 20934
3:                                 76639
4:                                123932
5:                                160561
6:                                237046
   participants.8.stats.totalDamageTaken participants.9.stats.totalPlayerScore
1:                                 18218                                     0
2:                                 12378                                     0
3:                                 46182                                     0
4:                                 19340                                     0
5:                                 30808                                     0
6:                                 36194                                     0
... [there are thousands of lines I omit here] ...

Of course, I now want a representation of the data, where one row corresponds to one participant. I imagine a result like this:

> [magic]
> head(d)
   participant             stats.totalDamageDealt
1:           1                             118504
2:           2                             190143
3:           3                              46700
4:           4                              60787
5:           5                              78108
6:           6                             124761
                  stats.totalDamageTaken                stats.totalPlayerScore
1:                                 18218                                     0
2:                                 15794                                     0
3:                                 34578                                     0
4:                                 78771                                     0
5:                                 16749                                     0
6:                                 11540                                     0
...

But all the methods that are out there, like meld, cast, and reshape will need me to name all the columns by hand. Even with patterns for meld I end up having to name all my hundreds of columns per participant. Is there no way out there to make this thing long shape in R?

Matteo B.
  • 3,906
  • 2
  • 29
  • 43
  • Something like `library(data.table); setDT(d); melt(d, measure = patterns("^participants"), value.name = "participant")`. Then you can extract participant number and stat into their own columns with Regex. Then `dcast` using `stat` and `value` as your key-value pair, to get to a 1 row per ID result. – Mako212 Jan 17 '19 at 19:39
  • 2
    If you provide a sample of your data with `dput(head(d[,1:10],10))` it would be a lot easier to create a working example. – Mako212 Jan 17 '19 at 19:40
  • I would approach this by first gathering all the headers with `tidyr::gather(d, stat, value). Then parse out `stat` into two columns, one for the participant and one for the generic stat title. Then `tidyr::spread` based on the stat title. – Jon Spring Jan 17 '19 at 19:43
  • @Mako212 This is the exact output when I execute your command: https://pastebin.com/PkMeTaAX – Matteo B. Jan 17 '19 at 20:22
  • Here are the complete first 2 rows of my data in CSV format: https://pastebin.com/0rsuATXP – Matteo B. Jan 17 '19 at 20:32
  • @JonSpring the output for `tidyr::gather(d, stat, value)` is `Fehler: kann Vektor der Größe 21.1 GB nicht allozieren` which is German for `Error: cannot allocate vector of size 21.1 GB`. – Matteo B. Jan 17 '19 at 20:36

3 Answers3

0

I'm not 100% sure I understand how the data is laid out, but I think I have it. From the example data it looks like participant 1 has multiple rows of data for totalDamageDealt from the raw data and that the result doesn't require aggregation. If that's not true there may be different steps required. I had to make my own sample data to try to run this. If you want to post a minimal set of data that covers all possibilities it would be helpful.

Otherwise, here's some methods for making the data completely long to extract participant information and then wide again to make it into the format you desired. If you need any aggregation when making the data wide that likely happen in the dcast step.

library(data.table)
library(stringr)

# Create example data
dt <- data.table(participant.1.stats.visionScore = c(1,1.1,1.2,1.3,1.4,1.5),
           participant.1.stats.totalDamageDealt = c(7.1,8.1,9.1,10.1,11.1,12.1),
           participant.2.stats.visionScore = c(2,2.1,2.2,2.3,2.4,2.5),
           participant.2.stats.totalDamageDealt = c(7.2,8.2,9.2,10.2,11.2,12.2))

# Make data totally long (not wide at all)
dt <- melt(dt,measure.vars = names(dt))

# Separate participant and stat details into columns
dt[,participant := variable %>% str_extract("(?<=^participant\\.)\\d+")]
dt[,stat := variable %>% str_extract("(?<=.stats.).+")]

# Remove variable for cleanup
dt[,variable := NULL]

# Create an index to create a unique key in order to be able to dcast without aggregating
dt[,index := 1:.N, by = list(participant,stat)]

# dcast to make the data wide again
dt <- dcast(dt,index + participant ~ stat, value.var = "value")

# Sort to make it easier for a human to view the table
dt <- dt[order(participant)]

#     index participant totalDamageDealt visionScore
# 1:      1           1              7.1         1.0
# 2:      2           1              8.1         1.1
# 3:      3           1              9.1         1.2
# 4:      4           1             10.1         1.3
# 5:      5           1             11.1         1.4
# 6:      6           1             12.1         1.5
# 7:      1           2              7.2         2.0
# 8:      2           2              8.2         2.1
# 9:      3           2              9.2         2.2
# 10:     4           2             10.2         2.3
# 11:     5           2             11.2         2.4
# 12:     6           2             12.2         2.5
Adam Sampson
  • 1,971
  • 1
  • 7
  • 15
0

Okay, using your provided sample of data:

library(data.table)

setDT(d) 

d <- melt(d, measure = patterns("^participants"), value.name = "value")

d <- d[,  `:=` (ID = gsub(".*?\\.(\\d+)\\..*","\\1", variable),
                stats = gsub(".*?(stats\\..*)$","\\1", variable))
  ][, .(variable, value, ID, stats)]
d <- dcast(d, ID ~ stats, value.var= "value", fun.aggregate = sum)

Edit: re-wrote this as a data.table only solution for speed

Note that you have some additional columns in your source data like participantIdentities.6.player.accountId that you don't address, so I've simply excluded them. If they need to be included, you'd either add those to patterns, or id.vars in melt.

One note: all the values you cast must be numeric, otherwise dcast will fail. I believe this will be a problem with your full data set. This means you need to either properly identify columns like participants.1.highestAchievedSeasonTier as id.vars in melt, or otherwise exclude them from dcast.

Resulting in (I'm just pasting the first 4 columns of many)

  ID participants.4.timeline.xpPerMinDeltas.20-30 stats.goldEarned stats.perk3Var1
1  1                                            0                0               0
2  4                                           NA                0            3475
3  7                                            0                0               0
4  8                                            0                0               0
5  9                                            0           105872               0
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • On paper your solution looks cool, but even with only 50k of my 1,000k rows, the line `d <- melt(d, measure = patterns("^participants"), value.name = "value")` needs more than 30GB RAM! R just shuts down. Maybe this is because it tries to convert all columns to type 'character' (which is necessary, as some of them really are of that type). – Matteo B. Jan 17 '19 at 22:15
  • Hmm, try dropping the non-numeric columns, just to see if that fixes the memory issue: `l1 <- unlist(lapply(d, is.numeric)); d <- d[, l1]` – Mako212 Jan 17 '19 at 22:24
  • Yea, good idea, but it still keeps allocating that much RAM. Of course it doesn't say anymore that it converts to type 'character', so I guess that wasn't the problem in the first place. – Matteo B. Jan 17 '19 at 22:44
  • There's some good existing answers on dealing with large data sets, using either `data.table` or other libraries. I'd recommend taking a look at those, given that it seems like the main issue is memory, rather than the actual data manipulation process. You could also consider spinning up an AWS EC2 instance to run a `data.table` solution on a machine with plenty of RAM to allocate the vector – Mako212 Jan 17 '19 at 23:02
  • @Matmarbon also, check that `memory.limit()` is close to your total installed RAM, if it's low for some reason, that could be causing the issue. It should return a value in MB – Mako212 Jan 17 '19 at 23:33
  • 1
    You are right.. with (in comparison) extremely little data and few columns, this solution works – and that is all I can expect for now. The other part of the problem is solvable by some google research. So thank you for solving this part :) – Matteo B. Jan 17 '19 at 23:41
0

I have found an answer that works even with this huge amount of data pretty efficiently. In fact I guess it is simply as efficient for this scenario as it gets in R:

cn <- names(d)
pc <- cn[which(grepl("participants.", cn))]
ppcn <- substring(pc[0:(length(pc)/10)], 16)
d_long <- reshape(d, direction='long', varying=pc, timevar='participant', times=c('participants.0', 'participants.1', 'participants.2', 'participants.3', 'participants.4', 'participants.5', 'participants.6', 'participants.7', 'participants.8', 'participants.9'), v.names=ppcn)

The idea behind it is to craft the arguments to the reshape function with some additional lines of code so R can know what columns I'm really talking about.

With this solution my long d (no pun intended) is created (1) in a single step without the need for temporary potentially large tables and (2) with no type conversions, including columns of all types.

Matteo B.
  • 3,906
  • 2
  • 29
  • 43