0

it is related to this other question: The dimensions in hist for numpy.histogram with density = True

but there I've been too generic so now I'll go straight to the point:

I have a 633x34 matrix, where every row is a numeric vector like this one:

    > dput(head(A,1))
structure(c(0.00198789974070879, -0.00172860847018153, -0.00527225583405355, 
0.00639585133967147, -0.00242005185825411, -0.00717372515125336, 
0.0037165082108902, 0.00164217804667233, 0.00034572169403646, 
-0.00864304235090751, -0.00639585133967158, 0.0068280034572169, 
0.00354364736387214, 0.000432152117545437, -0.00440795159896279, 
0.00544511668107173, 0.0031979256698359, 0.00164217804667233, 
0.000259291270527373, -0.00155574762316346, 0.00129645635263609, 
0.00259291270527229, -0.00397579948141746, 0.00328435609334476, 
0.00207433016421787, 0.00112359550561814, 0.00440795159896257, 
0.00164217804667266, -0.00319792566983579, 0.00233362143474514, 
0.00025929127052704, 0.000172860847018175, 0.000864304235090874, 
0.003630077787381), .Dim = c(1L, 34L))

I'm trying to build a matrix B of nrow = nrow(A) and ncol = 10 where each line is the result of the product between diff(hist$breaks) and hist$density.

The problem is that hist() does not accept a fixed number of bins (in my case 10), but takes an integer just as a suggestion (per documentation). So this loop of mine:

    B <- matrix(, nrow = 633, ncol = 10)
    for(i in 1:nrow(A)){
        B[i,] <- diff(hist(B[i,], breaks = 10, freq = TRUE)$breaks) * hist(B[i,], breaks = 10, freq = TRUE)$density
    }

Obviously gives:

    Error in distribution_rep[i, ] <- diff(hist(dS[i, ], breaks = 10, freq = TRUE)$breaks) *  : 
  number of items to replace is not a multiple of replacement length

As an error because per every row the number of bins is different.

The best fix would be to use a function to compute the vector of breakpoints, I've tried with seq(min(A[i, ]), max(A[i, ]), by = length(A[i,]/3.4)) but it doesn't work.

Do you guys know what expression I could feed into breaks = to reach my goal or another way to fix this issue? Thanks for your time.

EDIT: as asked I'm gonna elaborate further on the goals of this question, some details are already in my other open question I linked before. I'm porting some code from Python to R and I'm stuck at a line where numpy.histogram is used. this is the line that causes me trouble:

hist, bin_edges = np.histogram(A, bins=10, density=True)

then I have to use the output of that line in this way:

B = hist*np.diff(bin_edges)

building a matrix B with dimensions (nrow(A), bins) as a representation of the distribution. My desired first row of the matrix B is

array([ 0.05882353,  0.02941176,  0.05882353,  0.05882353,  0.08823529,
    0.14705882,  0.23529412,  0.20588235,  0.02941176,  0.08823529])

The two main problems I'm facing now are: a) understand the output of hist in Python with density=True (approached in my other question) b) develop a method to obtain an equal numer of bins with hist() in R for different vectors.

goingdeep
  • 99
  • 1
  • 9
  • "where each line is the result of the product between diff(hist$breaks) and hist$density." What is the purpose of this? – Roland Mar 20 '18 at 17:55
  • to build the representation of the distribution – goingdeep Mar 20 '18 at 18:17
  • See help("ecdf") for a better approach. – Roland Mar 20 '18 at 20:27
  • I've tried ecdf() but I don't think it is the right function for me, maybe you can elaborate on your advice? P.S. I'll edit my question to provide more details as asked. – goingdeep Mar 21 '18 at 06:48
  • Size hist is designed for plot and has the pretty axis restriction. You may have to divide your data manually. If this works for you, I can post as an answer: `tapply(A, cut(A, 10, include.lowest = TRUE), length)/length(A)` – Dave2e Mar 21 '18 at 15:38

1 Answers1

0

The mathematical function I was looking for was

breaks=seq(min(data),max(data),l=number_of_bins+1)

easier than I thought, thanks everybody anyway.

goingdeep
  • 99
  • 1
  • 9