I was reasoning that if tcmalloc were maintaining a per-thread free list underneath from which dynamic allocations would be satisfied from then the performance of tcmalloc in the average case should be very close to stack allocation (the cost of resizing the pool is amortized over many operations).
Does this hold in actual practice? Are there de-generate cases I'm not thinking of?