In Poly1305's features (https://cr.yp.to/mac.html) it is listed that:
(Parallelizability and incrementality) Poly1305-AES can take advantage of additional hardware to reduce the latency for long messages, and can be recomputed at low cost for a small modification of a long message.
Looking through the code in the mbedTLS implementation on Github for example I could not find if or how this feature is used. Is there some well known implementation that utilizes this parallelizability, and if so how does it do it roughly?