A "sharder" that simply round-robins requests over the set of T within
it.
This uses thread-local counters to eliminate write contention of the
"index" variable, so that throughput scales linearly with cores. (As
opposed to an atomic which would cause the variable to ping-pong between
core caches with a 6000% reduction in throughput, or slower yet, a
mutex-wrapped counter.)
This doesn't really need to be fallible but forces propagation of a ton
of error handling - no shards is always a sign of something being very
wrong, and can be caught in the caller if it's for some reason an
acceptable state / can be recovered from.