The 32U4 has timers identical to the 328P on which I did tests for your problem. I used Timer 1 which offers the best resolution. This timer can be run in CTC mode and channel A can be bound to a fixed output pin in toggle-on-compare-match. This makes the setup extremely simple and requires no interrupt logic. The frequency can be controlled simply by writing to OCR1A (this register is double buffered so that changes to frequency should be glitch-free)*.
In CTC mode Timer 1 has an output frequency of:
f n x = f_cpu / (2 * n * (1 + x))
where n is the pre-scale value and x the overflow compare register. Exploring the possible frequency ranges on a 16MHz clock gives:
| N | f-min | f-max | r-min | r-max | x-100 | x-25k |
+-----+--------+-----------+-----------+-----------+-------+-------+
| 1 | 122.1 | 8,000,000 | 4,000,000 | 0.0019 | n/a | 319 |
| 8 | 15.3 | 1,000,000 | 500,000 | 0.00023 | 9,999 | 39 |
| 64 | 1.91 | 125,000 | 62,500 | 0.000029 | 1,249 | 4 |
| 256 | 0.49 | 31,250 | 15,625 | 0.0000073 | 311 | n/a |
|1024 | 0.12 | 7,812 | 3,906 | 0.0000018 | 77 | n/a |
where N is the pre-scale setting, f-min & f-max the minimum and maximum achievable frequencies r-min & r-max the minimum and maximum frequency resolution and finally x-100 & x-25k the required values for OCR1A for 100Hz and 25kHz output respectively.
For a complete worked example here is a program that cycles through frequencies 1Hz, 2, 5, 10... 500kHz in 2 second steps, sufficient to observe the workings on a scope:
#include <avr/io.h>
#include <util/delay.h>
struct CTC1
{
static void setup()
{
// CTC mode with TOP-OCR1A
TCCR1A = 0;
TCCR1B = _BV(WGM12);
// toggle channel A on compare match
TCCR1A = (TCCR1A & ~(_BV(COM1A1) | _BV(COM1A0))) | _BV(COM1A0);
// set channel A bound pin to output mode
DDRB |= _BV(1); // PB1 on 328p, use _BV(5) for PB5 on 32U4
}
static void set_freq(float f)
{
static const float f1 = min_freq(1), f8 = min_freq(8), f64 = min_freq(64), f256 = min_freq(256);
uint16_t n;
if (f >= f1) n = 1;
else if (f >= f8) n = 8;
else if (f >= f64) n = 64;
else if (f >= f256) n = 256;
else n = 1024;
prescale(n);
OCR1A = static_cast<uint16_t>(round(F_CPU / (2 * n * f) - 1));
}
static void prescale(uint16_t n)
{
uint8_t bits = 0;
switch (n)
{
case 1: bits = _BV(CS10); break;
case 8: bits = _BV(CS11); break;
case 64: bits = _BV(CS11) | _BV(CS10); break;
case 256: bits = _BV(CS12); break;
case 1024: bits = _BV(CS12) | _BV(CS10); break;
default: bits = 0;
}
TCCR1B = (TCCR1B & ~(_BV(CS12) | _BV(CS11) | _BV(CS10))) | bits;
}
static inline float min_freq(uint16_t n)
{
return ceil(F_CPU / (2 * n * 65536));
}
};
void setup()
{
CTC1::setup();
}
void loop()
{
for (uint8_t x = 0; x < 6; ++x)
for (uint8_t y = 0; y < 3; ++y)
{
float k = y > 0 ? (y > 1 ? 5 : 2) : 1;
CTC1::set_freq(k * pow(10, x));
_delay_ms(2000);
}
}
int main()
{
setup();
for (;;)
loop();
}
The signal is observable on PB1 (digital pin 9 on an Arduino Uno). Note that on 32U4 channel-A is bound to PB5.
As Aleksander Z. kindly commented, the OCR1A register is not double-buffered in CTC mode. When switching frequencies this can lead to severe glitches, e.g.:

Depending on the application this may be easily fixed by busy-looping (though this may not work well for very high frequencies or may cause unacceptable delays under very low frequencies):
while (TCNT1 > x)
;
OCR1A = x;
Producing:
