As the timer code became more generic, coping with initialization on
demand, and variable width and speed us_ticker_api implementations,
wait_us has gradually gotten slower and slower.
Some platforms have reportedly seen overhead of wait_us() increase from
10µs to 30µs. These changes should fully reverse that drop, and even
make it better than ever.
Add fast paths for platforms that provide compile-time information about
us_ticker. Speed and code size is improved further if:
* Timer has >= 2^32 microsecond range, or better still is 32-bit 1MHz.
* Platform implements us_ticker_read() as a macro
* Timer is initialised at boot, rather than first use
The latter initialisation option is the default for STM, as this has
always been the case.