I am currently working on the transformation between object, camera and world coordinates in an inverse image projection task. I have the following information available:
image coordinates of an object in the form of (u,v,1)
Euler angles converted to a rotation matrix (R)
the translation vector (t) representing distances between the camera and the GNSS receiver
camera matrix (K), and GNSS position.
To convert the camera coordinates to world coordinates, I have followed the steps outlined in the literature. First, I computed the camera coordinates (Xc, Yc, Zc) by inverting the intrinsic matrix K:
λ * [Xc, Yc, Zc] = K^(-1) * [u, v, 1]
Next, I want to transform the camera coordinates (Xc, Yc, Zc) to world coordinates (X, Y, Z) by inverting the extrinsic matrix [R | t], is this equation correct?:
[X, Y, Z] = R^(-1) * ( [Xc, Yc, Zc] * t )
Here is where my confusion arises. In my specific case, the desired world coordinates correspond to the objects detected by the camera. Therefore, I believe that I should add the translation vector (t) to my camera coordinates as follows:
[X, Y, Z] = R^(-1) * ( [Xc, Yc, Zc] + t )
However, from my understanding of the documentation, it seems that t is typically subtracted instead:
[X, Y, Z] = R^(-1) * ( [Xc, Yc, Zc] - t )
I would appreciate clarification on whether my understanding is correct and whether I should add or subtract or even multiply the translation vector in my case.
It is important to note that the translation vector (t) I have is obtained by manually measuring the distances between the camera and the GNSS receiver. I have not got the t vector typically obtained through solvePnP. Do I need to use the solvePnP-generated t vector, or is the manually measured t vector sufficient for my purposes?