R hanging up when using %in%

Question

I have 2 moderate-size datasets that I am using in R. I want to check one dataset if its referenece number matches with the reference numbers in the other dataset and if so, allot a column in the second dataset which contains the value present in the column in the other dataset.

ghi2$state=ifelse(b1$accntnumber %in% ghi2$referencenumber,b1$address,0)

Every time I am running this code, my RStudio hangs up and is unresponsive for a long time. Is it because its taking the time to process the command or is my command wrong. I am using a 2GB RAM system so I think R hangs up. Should I use the == operator instead of %in%? Would I get the same result?

how big is your dataset? You cannot use `==` if there are more than one element in the `referencenumber` column in 'gh2' as it will go for elementwise comparison — akrun, Jun 30 '16 at 02:48
are these objects characters or factors? try class(ghi2$referencenumber) — milan, Jun 30 '16 at 02:51
Something seems odd here. I can run this code on a million cases to a million cases in 0.3seconds without making a dint in my RAM. The issue might be in copying your `ghi2` dataset, if it is monumentally sized. — thelatemail, Jun 30 '16 at 03:04
There are faster ways. See [this post](http://stackoverflow.com/q/33453141/4408538) — Joseph Wood, Jun 30 '16 at 05:32

Hack-R · Answer 1 · 2016-06-30T03:22:38.367

1. Should I use the == operator instead of %in%?

No (!). See #2.

2. Would I get the same result?

No. The order and position have to match with ==. Also, see @Akrun's comment.

3. How to make it faster and/or deal with RStudio freezing

If RStudio freezes you can save your log file info, send it to the RStudio team who will quickly respond, and also you could bring your log files here for help.

Beyond that, general Big Data rules apply. Here are some tips:

Try data.table
Try it on the command line instead of RStudio
Watch your Resource Monitor (or whatever you use to monitor resources) and observe the memory and CPU usage
If it's a RAM issue you can

a. use a cloud account to get more RAM

b. buy some more RAM (just sayin')

c. use 64-bit R and increase the RAM available to R to its max if it's not already
If it's a CPU issue you can consider parallelization
If any of these ID's are being repeated (and this makes sense in the context of your specific use-case) you can use unique to avoid redundant comparisons

There are lots of other tips you can find in pre-existing Big Data Q&A's on SO as well.

R hanging up when using %in%

1 Answers1