Lets start with the eyes and perception [1].
The human eye and its brain interface, the human visual system, can
process 10 to 12 separate images per second, perceiving them
individually. The visual cortex holds onto one image for about
one-fifteenth of a second, so if another image is received during that
period an illusion of continuity is created, allowing a sequence of
still images to give the impression of motion.
From this stance, it's certainly possible to to detect motion by at least 1/10 of a second (perhaps a twitch).
The other part of this is determination of if the movement was involuntary or not, which is more or less a trained art.
Another type of microexpression is auditory microexpressions, which involves detection in the form of involuntary inflections of pitch in your in somebody's voice. First, the biology of the ear [2]:
The inner ear, called the cochlea, has a remarkable thin compliant
membrane stretched out along inside it. This membrane is called the
basilar membrane, and it has two really interesting features. First,
it is resonant at different frequencies at different points along its
surface, and second, it is infused with thousands of nerve endings
(called hair cells) that are attached to the auditory nerve going to
the brain.
The net effect of these two features is that different
frequencies excite different hair cells, so that in a general (if
incomplete) way we can think of the mechanism as causing each hair
cell to be excited by a very specific range of frequencies. This leads
us to a concept called the "place" theory of pitch detection.
Secondly how we perceive sound:
The shortest sound we can hear is an impulse about 20
microseconds long. This is related, of course, to the upper limit of
frequencies, about 20,000 Hz. We can theoretically hear sounds that
are indefinitely long. However, from a perceptual standpoint they
cease to be perceived as events and instead become “a continuum.”
...
Meanwhile, the sound events
(like musical notes and spoken words) that we work with in audio are
generally between 50 milliseconds (1/20th of a second) and five
seconds long. There are two other primary things to know about the
time dimension. The first is that there is a fundamental neurological
boundary for humans at about 50 milliseconds. Events occurring more
quickly than that are perceived as a single or continuous event, not
as a series of events. Events occurring less quickly are perceived as
separate individual events. This holds true for vision as well (hence
our 30 fps rate for film). So as I noted earlier, a 5 Hz square wave
will be heard as a series of clicks, while a 50 Hz square wave will be
heard as a continuous tone. The other time phenomenon worth keeping
in mind is our integration time. When we perceive “a sound,” we
integrate all of the versions of that sound that occur within the
first 50 milliseconds of the sound into a single holistic perception.
So we sort of average all versions of a sound that occur during that
period to produce our conscious perception of the sound source. We
will discuss this at considerable length in later articles. Finally,
we have to keep in mind that frequencies are a subset of the time
dimension. Individual pitched (periodic) sounds consist of an array of
frequencies, as I mentioned above. The lowest such frequencies
(fundamentals) usually fall within a four octave range from about 60
Hz to 1 kHz All frequencies above 1 kHz can be regarded as harmonics
that enable us to determine timbre and differentiate sounds from each
other.
This latter part of about perception of sound is rather interesting when compbined with visual microexpression. The two sences seem to enhance one another if experienced simultaneously (ie, face to face conversations).
There is certainly potential for detection, but it's up to perceiver to decide whether it is indeed a microexpression. So what's the success rate in determining of a trained eye? Taken from the book "The Philosophy of Deceptions" and it's research, we find:
Measurements we have made of facial movements, voice, and speach show
that high levels of accuracy are possible - over 80 percent correct
classifications of who is lying and who is telling the truct. While
making those measurements required slow-motion replays, we also know
that accurate judgments are possible just by viewing the videotapes. A
small percent of thos we have studied have reached 80 percent or
better accuracy, and tey have done so in more than one scenario, so it
unlikely that their accuracy was a fluke (O'Sullivan and Ekman 2005).
So, more often than not, a human eye CAN catch microexpressions in real-time.
[1] http://en.wikipedia.org/wiki/Frame_rate
[2] http://www.recordingmag.com/resources/resourceDetail/194.html
[3] The Philosophy of Deception Pg 123