* use the new packed reader and writer api to be compatible with current
etcd meta
* For the new packed writer API: column groups and paths are explicitly
defined by users and won't split column groups by memory in storage v2.
Packed writer follows the user-defined column groups to split arrow
record and write into the corresponding file path.
* For the new packed reader API: read paths are explicitly defined by
users.
related: #39173
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
issue: #34357
Go Parquet uses dictionary encoding by default, and it will fall back to
plain encoding if the dictionary size exceeds the dictionary size page
limit. Users can specify custom fallback encoding by using
`parquet.WithEncoding(ENCODING_METHOD)` in writer properties. However,
Go Parquet [fallbacks to plain
encoding](e65c1e295d/go/parquet/file/column_writer_types.gen.go.tmpl (L238))
rather than custom encoding method users provide. Therefore, this patch
only turns off dictionary encoding for the primary key.
With a 5 million auto ID primary key benchmark, the parquet file size
improves from 13.93 MB to 8.36 MB when dictionary encoding is turned
off, reducing primary key storage space by 40%.
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
See also #34483
Some lint issues are introduced due to lack of static check run. This PR
fixes these problems.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #34123
Benchmark case: The benchmark run the go benchmark function
`BenchmarkDeltalogFormat` which is put in the Files changed. It tests
the performance of serializing and deserializing from two different data
formats under a 10 million delete log dataset.
Metrics: The benchmarks measure the average time taken per operation
(ns/op), memory allocated per operation (MB/op), and the number of
memory allocations per operation (allocs/op).
| Test Name | Avg Time (ns/op) | Time Comparison | Memory Allocation
(MB/op) | Memory Comparison | Allocation Count (allocs/op) | Allocation
Comparison |
|---------------------------------|------------------|-----------------|---------------------------|-------------------|------------------------------|------------------------|
| one_string_format_reader | 2,781,990,000 | Baseline | 2,422 | Baseline
| 20,336,539 | Baseline |
| pk_ts_separate_format_reader | 480,682,639 | -82.72% | 1,765 | -27.14%
| 20,396,958 | +0.30% |
| one_string_format_writer | 5,483,436,041 | Baseline | 13,900 |
Baseline | 70,057,473 | Baseline |
| pk_and_ts_separate_format_writer| 798,591,584 | -85.43% | 2,178 |
-84.34% | 30,270,488 | -56.78% |
Both read and write operations show significant improvements in both
speed and memory allocation.
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
See also #33787
The parsing delete log is distributed in lots of places, which is not
recommended and hard to maintain.
This PR abstract common parsing logic into `DeleteLog.Parse` method to
unify implementation and make it easier to replace json parsing lib.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>