Becuase `bitset.and()` allocates a new bitset regardles of the resulting
cardinality we will be allocating more bitsets than necessary. This
change checks if we actually want to make the allocation.
It improves `read_group` performance by ~2X.
```
segment_read_group_pre_computed_groups_no_predicates_cardinality/2000
time: [57.917 ms 58.286 ms 58.700 ms]
thrpt: [34.072 Kelem/s 34.313 Kelem/s 34.532 Kelem/s]
change:
time: [-59.703% -59.357% -59.057%] (p = 0.00 < 0.05)
thrpt: [+144.24% +146.05% +148.16%]
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
```
This commit adds benchmarks to track the performance of `read_group`
when aggregating across columns that support pre-computed bit-sets of
row_ids for each distinct column value. Currently this is limited to the
RLE columns, and only makes sense when grouping by low-cardinality
columns.
The benchmarks are in three groups:
* one group fixes the number of rows in the segment but varies the
cardinality (that is, how many groups the query produces).
* another groups fixes the cardinality and the number of rows but varies
the number of columns needed to be grouped to produce the fixed
cardinality.
* a final group fixes the number of columns being grouped, the
cardinality, and instead varies the number of rows in the segment.
Some initial results from my development box are as follows:
```
time: [51.099 ms 51.119 ms 51.140 ms]
thrpt: [39.108 Kelem/s 39.125 Kelem/s 39.140
Kelem/s]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
segment_read_group_pre_computed_groups_no_predicates_group_cols/1
time: [93.162 us 93.219 us 93.280 us]
thrpt: [10.720 Kelem/s 10.727 Kelem/s 10.734
Kelem/s]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
segment_read_group_pre_computed_groups_no_predicates_group_cols/2
time: [571.72 us 572.31 us 572.98 us]
thrpt: [3.4905 Kelem/s 3.4946 Kelem/s 3.4982
Kelem/s]
Found 12 outliers among 100 measurements (12.00%)
5 (5.00%) high mild
7 (7.00%) high severe
Benchmarking
segment_read_group_pre_computed_groups_no_predicates_group_cols/3:
Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to
increase target time to 8.9s, enable flat sampling, or reduce sample
count to 50.
segment_read_group_pre_computed_groups_no_predicates_group_cols/3
time: [1.7292 ms 1.7313 ms 1.7340 ms]
thrpt: [1.7301 Kelem/s 1.7328 Kelem/s 1.7349
Kelem/s]
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
6 (6.00%) high mild
1 (1.00%) high severe
segment_read_group_pre_computed_groups_no_predicates_rows/250000
time: [562.29 us 565.19 us 568.80 us]
thrpt: [439.52 Melem/s 442.33 Melem/s 444.61
Melem/s]
Found 18 outliers among 100 measurements (18.00%)
6 (6.00%) high mild
12 (12.00%) high severe
segment_read_group_pre_computed_groups_no_predicates_rows/500000
time: [561.32 us 561.85 us 562.47 us]
thrpt: [888.93 Melem/s 889.92 Melem/s 890.76
Melem/s]
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
segment_read_group_pre_computed_groups_no_predicates_rows/750000
time: [573.75 us 574.27 us 574.85 us]
thrpt: [1.3047 Gelem/s 1.3060 Gelem/s 1.3072
Gelem/s]
Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) high mild
8 (8.00%) high severe
segment_read_group_pre_computed_groups_no_predicates_rows/1000000
time: [586.36 us 586.74 us 587.19 us]
thrpt: [1.7030 Gelem/s 1.7043 Gelem/s 1.7054
Gelem/s]
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe
```
The `ReadFilterResults` type encapsulates results from multiple
segments. It implements `Display` to allow visualisation of results from
segments in a `select` call.
This commit also adds `Display` and `Debug` implementations for
`ReadFilterResult`. These can be used for visualising the contents of
the result of a `read_filter` call on a segment.
The former trait elides the column names.
This commit adds an alternative implementation of `row_ids_equal` for
the `Plain` dictionary encoding, which uses SIMD intrinsics to improve
the performance of identifying all rows in the column containing a
specified `u32` integer.
The approach is as follows. First, the integer constant of interest is
packed into a 256 bit SIMD register. Then the column is iterated over
in chunks of size 8 (thus, 256 bits at a time). The expectation is that
for a colum using this encoding it is likely most values will not match
an equality predicate, so the happy path is to compare the packed
register against each chunked register. This is done using the
`_mm256_cmpeq_epi32`[1] intrinsic, which returns a mask where each 32
bits is `0xFFFFFFFF` if the two values at that location in the register
are equal, or `0x00000000` otherwise.
Becuase the expectation is that most values don't match the id we want,
we check if all 32-bit values in this 256-bit mask register are `0`. If
the register's values are not all 0 then the register is inspected to
determine the locations where values match. The offsets of these values
are used to determine the row id to add to the result set.
On my laptop, benchmarking indicates that the SIMD implementation
increases throughput performance (finding all matching rows) by
~100%-390%.
This SIMD implementation will be automatically used if the CPU supports
avx2 instructions, otherwise the a non-SIMD implementation will be
fallen back to.
[1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_cmpeq_epi32&expand=774
```