Your first question is "why the + 1
?"
Let's look at how these functions work:
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") )
# LSA
data(stopwords_en)
myMatrix = textmatrix(td, stopwords=stopwords_en)
myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
myLSAspace = lsa(myMatrix, dims=dimcalc_share())
as.textmatrix(myLSAspace)
D1 D2 D3
cat 0.3616693 0.6075489 0.3848429
dog 0.4577219 0.2722711 1.2710784
mouse 0.5942734 1.3128719 0.1357196
ham 0.6075489 1.5336529 -0.1634938
sushi 0.6075489 1.5336529 -0.1634938
pet 0.6099616 -0.2591316 2.6757285
So, lsa
gets dimensions from dimcalc_share()
based on the input matrix and a given share (.5 by default) and runs a Singular Value Decomposition to map the original TDM to a new LSAspace
.
Those dimensions are the number of singular values for the dimensionality reduction in LSA.
dimcalc_share()
finds the first position in the descending sequence of singular values s where their sum (divided by the sum of all values) meets or exceeds the specified share.
The function is written such that it d
is equal to the max()
position <= share
:
> # Break it apart
> s <- myMatrix
> share <- .5
>
> any(which(cumsum(s/sum(s)) <= share)) #TRUE
[1] TRUE
> cumsum(s/sum(s)) <= share
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> d = max(which(cumsum(s/sum(s)) <= share)) + 1
> d
[1] 10
If you only used d -1
, which would give you 9 instead of 10, then you'd instead have a position where the cumsum
is still <=
to share
. That wouldn't work:
> myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
> myLSAspace2 = lsa(myMatrix, dims=d-1)
Error in SVD$u[, 1:dims] : subscript out of bounds
Equivalently
> dims = 9
> myLSAspace = lsa(myMatrix, dims)
Error in SVD$u[, 1:dims] : subscript out of bounds
So the function dimshare_calc()
is correct in using + 1
.
Your 2nd question, modified for this example, is "would dimcalc_share() = 18 instead of = 1 if the first value was > share?"
If the first value were > share
then the first if
condition would return false and, as you hypothesized, would instead use length(s)
which is 18.
You might follow up with a question on CrossValidated to confirm that your intuition that it should = 1
is correct (though that makes sense to me). If so, it would be simple to re-write the function with d = 1
as the else
.