2

I find myself having to do this very often -- compare specific columns from 2 different files. The columns, formats are the same, but the columns that need comparison have floating point/exponential format data, e.g. 0.0058104642437413175, -3.459017050577087E-4, etc.

I'm currently using the below R code:

test <- read.csv("C:/VBG_TEST/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
prod <- read.csv("C:/VBG_PROD/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
sqldf("select sum(V10), sum(V15) from test")
sqldf("select sum(V10), sum(V15) from prod")

I read in the files, and sum the specific columns -- V10, V15 and then observe the values. This way I can ignore very small differences in floating point data per row.

However, going forward, I would like to set a tolerance percent, ie. if abs( (prod.V10 - test.V10)/prod.V10 ) > 0.01%, and only print those row numbers that exceed this tolerance limit.

Also, if the data is not in the sane order, how can I do a comparison by specifying columns that will act like a composite primary key?

For e.g., if I did this in Sybase, I'd have written something like:

select A.*, B.* 
from tableA A, tableB B
where abs( (A.Col15-B.Col15)/A.Col15) ) > 0.01%
  and A.Col1 = B.Col1
  and A.Col4 = B.Col4
  and A.Col6 = B.Col6

If I try doing the same thing using sqldf in R, it does NOT work as the files contain 500K+ rows of data.

Can anyone point me to how I can do the above in R?

Many thanks, Chapax.

Marek
  • 49,472
  • 15
  • 99
  • 121
Chapax
  • 171
  • 2
  • 2
  • 7
  • 1
    can you elaborate on "the same thing" and "it does NOT work" please? – mdsumner Jul 02 '10 at 12:38
  • trying to run the Sybase type SQL in sqldf causes R to keep running for a very, very long time ... and that's why I said "it does NOT work". – Chapax Jul 02 '10 at 13:54
  • you seem to have several questions here; maybe break them out? I'm not sure if you are having trouble reading the values into a data.frame or if you're additionally looking for an answer assuming the data.frame's fully realized in memory. – ayman Jul 30 '10 at 16:00

2 Answers2

1

Au, this sqldf hurts my mind -- better use plain R capabilities than torture yourself with SQL:

which(abs(prod$V10-test$V10)/prod$V10>0.0001)

In a more general version:

which(abs(prod[,colTest]-test[,colTest])/prod[,colTest]>tolerance)

where colTest is an index of column that you want to test and tolerance is tolerance.

mbq
  • 18,510
  • 6
  • 49
  • 72
  • many thanks mbq ... can you also let me know how I can sort by col1, col2, etc? That will give me a complete solution. Also, as an aside, can u suggest good books/tutorials for learning R? I seem to have lots of such data issues that I end up manually doing in Excel, and I think R will help tremendously in automating. Thx a ton. – Chapax Jul 02 '10 at 13:23
  • Check http://cran.r-project.org/manuals.html , especially R-introduction. You may also find something shorter in Contributed documentation. Sorting a vector is just `sort`, `order` returns the index vector of sorted vector, so it helps sorting data frames. R has a built-in documentation, issue `?sort` or `?order` to get more info and examples. I would be grateful if you accept this answer. – mbq Jul 02 '10 at 14:39
0

I don't know R but I'm suggesting this as a general advice. You should paginate your table and then use your query. I mean I think in general is not wise to execute specific comparison instructions over a table that big.

dierre
  • 7,140
  • 12
  • 75
  • 120