0

I have a large dataset with ~1 mil rows and 8 cols (variables). One of those variables, ORDER, has categories from 1 to 90. I want to create a new data.frame with a reduced number of categories for the variable ORDER (4) 1, 2, 3+ and ALL, where ALL is the sum FREQUENCY for all categories (1-90) and 3+ is the sum of FREQUENCY for categories >=3 (so 3 to 90).

YEAR  PROVINCE  ZONA91OK AGE5 ORDER NATIONALITY_MOTHER NATIONALITY_FATHER FREQUENCY
 1979        1      101   15     1      No computable      No computable        10
 1989        3      102   20     1      No computable      No computable        50

I am very new in data management with R so any help on the issue is very much appreciated!

Here is a sample of the data.frame

mydata<-structure(list(YEAR = c(1981, 1981, 1981, 1981, 1981, 1981, 1981, 
1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 1981, 
1981, 1981, 1981), PROVINCE = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), ZONA91OK = c(101, 101, 101, 
101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 
101, 101, 101, 101, 101), AGE5 = c(15, 20, 20, 25, 25, 25, 25, 
30, 30, 30, 30, 30, 35, 35, 35, 35, 35, 35, 40, 40, 40), ORDER = c(1, 
1, 2, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 12, 1, 3, 5), 
NATIONALITY_MOTHER = structure(c(9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L
), .Label = c("España", "UE-15 y PD", "Resto Europa", "Magreb", 
"África Sub-sahariana", "Latinoamérica", "Asia", "Resto del Mundo", 
"No computable"), class = "factor"), NATIONALITY_FATHER = structure(c(9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L), .Label = c("España", "UE-15 y PD", "Resto Europa", 
"Magreb", "África Sub-sahariana", "Latinoamérica", "Asia", 
"Resto del Mundo", "No computable"), class = "factor"), FREQUENCY = c(10, 
40, 20, 50, 30, 10, 1, 10, 15, 10, 1, 1, 5, 5, 5, 1, 1, 1, 
1, 1, 1)), .Names = c("YEAR", "PROVINCE", "ZONA91OK", "AGE5", 
"ORDER", "NATIONALITY_MOTHER", "NATIONALITY_FATHER", "FREQUENCY"
 ), row.names = 60175:60195, class = "data.frame")
Ale
  • 303
  • 1
  • 9

1 Answers1

0

If your data has 1M rows, you're likely going to want to use data.table

library(data.table)
myDT <- data.table(mydata, key="ORDER")

specialCats <- c(1, 2, 3)

rbind(
    myDT[, list(SUM_FOR="ALL", FREQ_SUM=sum(FREQUENCY))]
  , myDT[!.(specialCats), list(SUM_FOR="3+", FREQ_SUM=sum(FREQUENCY))]
)

## RESULTS: 
       SUM_FOR FREQ_SUM
1:     ALL      219
2:      3+        7

Update re: comment

To change the ORDER column to your requirements, use:

myDT[, order := ifelse(ORDER %in% specialCats, as.character(ORDER), "3+")]

Note 1: that in order to have 3+ be a value, you need to convert to string.
Note 2: Adding a row for "ALL" does not make much sense, as what would you put for AGE, PROVINCE etc ?

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • If I understand correctly the **specialCats** are used to exclude the ORDER values 1 2 and 3 from the ALL sum, right? I have run the code, but when I **table(myDT)** there is no **3+** or **ALL** category – Ale Oct 11 '13 at 09:48
  • @Ale RE specialCase, correct. RE table(myDT). If you copy & pasted the code above, observe that the output from rbind is not getting assigned to anything (and that line of code does not alter myData). You need to use something like: `newDT <- rbind(<..>); newDT)` – Ricardo Saporta Oct 11 '13 at 11:36
  • I see... but what I want is completely different. My goal is to obtain a new data.frame with the exact same number of variables, where ORDER does not take values **1 to 90** but only **1, 2, 3+ or ALL** – Ale Oct 11 '13 at 13:45
  • @Ale, what would you expect the rows for `ALL` to contain in the other columns? – Ricardo Saporta Oct 11 '13 at 13:59
  • ALL would be a new category for _ORDER_ containing the number of births for all _ORDER_ (from _FREQUENCY_), while 3plus only those >=3. So far I have tried to _dcast_ the columns from ORDER to work with the columns to produce new variables but it did not work out well. – Ale Oct 15 '13 at 08:12
  • @Ale, My question was what would you expect in _the **other** columns_? Please try the following: Open up excel and create by hand what you would expect as an output. As you start putting that together, it should start to become clear that there are other columns that require information, for which it does not make much sense to simply take a random row as its value. Either way, when you have that Excel file, pleas feel free to post a copy of it in your question – Ricardo Saporta Oct 15 '13 at 14:46
  • not sure how to attach an excel file, here is the gist link to the txt equivalent output I obtained using SUMIFS to the csv file [https://gist.github.com/anonymous/0988aa70da95c7d60485] – Ale Oct 15 '13 at 16:53
  • link is dead :/ I didnt mean to literally attach the excel file, but rather copy and paste the text into your question in a clean output – Ricardo Saporta Oct 15 '13 at 16:59
  • so sorry for my clumsiness. – Ale Oct 16 '13 at 09:27
  • >`YEAR PROV Z91OK AGE5 ORDER N_M N_F FREQ > 2009 18 180901 15 ALL SPAIN SPAIN 9 > 2009 18 180901 15 ALL SPAIN UE15DC 1 > 2009 18 180901 20 ALL SPAIN SPAIN 7 > 2009 18 180901 20 ALL SPAIN UE15DC 1 > 2009 18 180901 20 ALL LA SPAIN 2 > 2009 18 180901 25 ALL SPAIN SPAIN 27 > 2009 18 180910 30 3plus SPAIN SPAIN 2 > 2009 18 180911 20 3plus SPAIN SPAIN 3 > 2009 18 180911 25 3plus SPAIN SPAIN 1 > 2009 18 180911 25 3plus UE15DC UE15DC 1` I hope this is going to work out. I pasted some rows of the output file. Tried to paste the dput but was too long. – Ale Oct 16 '13 at 09:37
  • `structure(list(YEAR=c(2009L,2009L,2009L,2009L,2009L,2009L ),PROV=c(18L,18L,18L,18L,18L,18L),Z91OK=c(180901L, 180901L,180901L,180911L,180911L,180911L),AGE5=c(15L,15L, 20L,25L,25L,25L),ORDER=structure(c(2L,2L,2L,1L,1L, 1L),.Label=c("3plus","ALL"),class="factor"),N_M=structure(c(4L, 4L,4L,5L,1L,1L),.Label=c("LA","M","ROE","SP", "UE15"),class="factor"),N_F=structure(c(5L,6L,5L,6L, 6L,5L),.Label=c("LA","Magreb","NoC","ROE","SPAIN", "UE15"),class="factor"),FREQ=c(9L,1L,7L,1L,1L,1L )),.Names=c("Y","P","Z91OK","A5","ORDER","N_M", "N_F","FREQ"),row.names=c(1L,2L,3L,150L,151L,152L),class="data.frame")` – Ale Oct 16 '13 at 09:54
  • @ale, I am very sorry, but you are going to need to open a new question – Ricardo Saporta Oct 16 '13 at 13:34