-1

I have a large data set around 3 columns and 600000 rows. After pivoting it will convert to 1000 columns and 600 rows.
Subset of input data

Date        Id      Price
2014/12/03  Id1     100
2014/12/03  Id2     120
2014/12/03  Id3     110
2014/12/03  Id4     105
2014/12/02  Id1     150
2014/12/02  Id2     115
2014/12/02  Id3     140
2014/12/02  Id4     135
2014/12/01  Id1     165
2014/12/01  Id2     155
2014/12/01  Id3     185
2014/12/01  Id4     195
2014/11/30  Id1     160
2014/11/30  Id2     170
2014/11/30  Id3     180
2014/11/30  Id4     190

Output data after pivoting

Date        Id1     Id2     Id3     Id4
2014/12/03  100     120     110     105
2014/12/02  150     115     140     135
2014/12/01  165     155     185     195
2014/11/30  160     170     180     190

As data set will be large, which will be more performance efficient?
1. Pivot in SQL and return to R
2. Return original dataset to R and perform pivot in R

I am using RODBC to fetch data from MSSQL server. Any suggestion regarding this?

Chaturvedi Dewashish
  • 1,469
  • 2
  • 15
  • 39
  • 9
    Why don't you benchmark the operations yourself? – Roman Luštrik Dec 26 '14 at 08:01
  • You can improve performence in DataTable by setting a column for indexing using Sort with DataView that will improve the perfomance in retreival of data – Sarath Subramanian Dec 26 '14 at 08:07
  • I've just replicated you complete data set (600000 rows and 3 columns) and `dcast(df, Date ~ Id)` resulted in a quarter of a second – David Arenburg Dec 26 '14 at 08:19
  • What about fetching data from SQL server. In case of pivoting data in SQL, RODBC will be fetching data of 600000 elements while in other case it will be 1800000 elements. Will it have any impact? – Chaturvedi Dewashish Dec 26 '14 at 08:26
  • 2
    That is actual a good question. My guess is that in SQL it won't take much longer than in R, but the data set will be 3 times smaller. Considering that RODBC isn't very efficient, I would definitely recommend doing this in SQL – David Arenburg Dec 26 '14 at 08:32

1 Answers1

0

I think that 'performance efficient" is not well defined in your case. Without knowing the specs of your machine and more details about the environment that MSSQL is running on, it is impossible to know which would be faster.

It's quite likely that your SQL environment is set up more efficiently than R for handling the operation, assuming that whoever configured that original database knows what they are doing.

So, assuming everything is relatively equal, do it before you get it into R.

hdizzle
  • 41
  • 4