Can you please clarify how exactly you are transforming those points?
The pin-hole camera model looks like this:
w*[x,y,1] = [X,Y,Z,1]*[R;t]*K
[X,Y,Z]
are the world coordinates in world units (e. g. millimeters), and [x,y]
are the image coordinates in pixels. K
is the matrix of camera intrinsics, and R
and t
are the camera extrinsics. w
is an arbitrary scale factor.
If you take a world point [X,Y,Z,1]
and multiply it by [R;t]
, then you get a point in a "camera's coordinate system", where the origin is at the focal point, and the units are the same as in your world coordinates (e. g. millimeters).
If you take a point in the image [x,y,1]
and multiply it by the inverse of K
, then you get a point in "normalized image coordinates", where the origin is at the optical center, and the axis have no units. This happens because you are dividing x
and y
in pixels by the focal length fx
and fy
, which is also in pixels. So pixels cancel out.