1

I have 2 moderate-size datasets that I am using in R. I want to check one dataset if its referenece number matches with the reference numbers in the other dataset and if so, allot a column in the second dataset which contains the value present in the column in the other dataset.

ghi2$state=ifelse(b1$accntnumber %in% ghi2$referencenumber,b1$address,0)

Every time I am running this code, my RStudio hangs up and is unresponsive for a long time. Is it because its taking the time to process the command or is my command wrong. I am using a 2GB RAM system so I think R hangs up. Should I use the == operator instead of %in%? Would I get the same result?

1 Answers1

2

1. Should I use the == operator instead of %in%?

No (!). See #2.

2. Would I get the same result?

No. The order and position have to match with ==. Also, see @Akrun's comment.

3. How to make it faster and/or deal with RStudio freezing

If RStudio freezes you can save your log file info, send it to the RStudio team who will quickly respond, and also you could bring your log files here for help.

Beyond that, general Big Data rules apply. Here are some tips:

  1. Try data.table
  2. Try it on the command line instead of RStudio
  3. Watch your Resource Monitor (or whatever you use to monitor resources) and observe the memory and CPU usage
  4. If it's a RAM issue you can

    a. use a cloud account to get more RAM

    b. buy some more RAM (just sayin')

    c. use 64-bit R and increase the RAM available to R to its max if it's not already

  5. If it's a CPU issue you can consider parallelization

  6. If any of these ID's are being repeated (and this makes sense in the context of your specific use-case) you can use unique to avoid redundant comparisons

There are lots of other tips you can find in pre-existing Big Data Q&A's on SO as well.

Hack-R
  • 22,422
  • 14
  • 75
  • 131