There is no common standard for the spatial axis for image sampling. A 20 megapixel sensor or camera will produce images at a completely different spatial resolution in pixels per mm, or pixels per degree angle of view than a 2 megapixel sensor or camera. These images will typically be rescaled to yet another non-common-standard resolution for viewing (72 ppi, 300 ppi, "Retina", SD/HDTV, CCIR-601, "4k", etc.)
For audio, 48k is starting to become more common than 44.1ksps. (on iPhones, etc.)
("a nice thing about standards is that there are so many of them")
Amplitude scaling in raw format also has no single standard. When converted or requantized to storage format, 8-bit, 10-bit, and 12-bit quantizations are the most common for RGB color separations. (JPEG, PNG, etc. formats)
Channel formats are different between audio and image.
X, Y, where X is time and Y is amplitude is only good for mono audio. Stereo usually needs T,L,R for time, left, and right channels. Images are often in X,Y,R,G,B, or 5 dimensional tensors, where X,Y are spatial location coordinates, and RGB are color intensities at that location. The image intensities can be somewhat related (depending on gamma corrections, etc.) to the number of incident photons per shutter duration in certain visible EM frequency ranges per incident solid angle to some lens.
A low-pass filter for audio, and a Bayer filter for images, are commonly used to make the signal closer to bandlimited so it can be sampled with less aliasing noise/artifacts.