milvus

Commit Graph

Author	SHA1	Message	Date
Zhen Ye	48ba5fbfcd	fix: use sliding window for old version message lastConfirmedMessageID to prevent long catchup (#48390 ) issue: #48389 Previously, all old version (v0) WAL messages shared the same lastConfirmedMessageID pointing to the very first v0 message. When a tailing scanner fell back to catchup mode (e.g., due to WAL ownership change), it would restart from this extremely old position, causing catchup times of 14+ minutes during which tsafe could not advance and all search requests would timeout. This change replaces the fixed first-message ID with a configurable sliding window (default size 30). The lastConfirmedMessageID now points to the message N positions back, bounding the WAL replay distance on fallback to at most N messages. Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 11:20:34 +08:00
Zhen Ye	446f06eb02	enhance: Implement rate limiting in WAL append operations (#47179 ) issue: #47178 This commit introduces a rate limiting mechanism for Write-Ahead Logging (WAL) operations to prevent overload during high traffic. Key changes include: - Added `RateLimitObserver` to monitor and control the rate of DML operations. - Add Adaptive RateLimitController to apply the strategy of rate limit. - WAL will slow down if the recovery-storage works on catchup mode or node memory is high. - Updated `WAL` and related components to handle rate limit states, including rejection and slowdown. - Introduced new error codes for rate limit rejection in the streaming error handling. - Enhanced tests to cover the new rate limiting functionality. These changes aim to improve the stability and performance of the streaming service under load. --------- Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 16:19:26 +08:00
sijie-ni-0214	b0a6a75f2b	enhance: optimize qn load speed (#47423 ) issue: https://github.com/milvus-io/milvus/issues/47422 --------- Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>	2026-03-05 16:55:21 +08:00
wei liu	220c691500	feat: [ExternalTable Part4] Support data mapping for external collections (#47730 ) design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md issue: https://github.com/milvus-io/milvus/issues/45881 ## Summary - Pre-allocate segment IDs in DataCoord, pass to DataNode for direct final-path manifest writes (eliminating two-phase ID workflow) - Add FFI bridges for file exploration (`ExploreFiles`, `GetFileInfo`) and manifest creation (`CreateManifestForSegment`, `ReadFragmentsFromManifest`) - Implement fragment-to-segment balancing with configurable target rows per segment - Add `ExternalSpec` parser for external data format configuration - Extend `UpdateExternalCollectionRequest` proto with schema, storage config, and pre-allocated segment ID fields - Add E2E test for external collection refresh with data verification > Note: This PR includes Part3 changes (PR #47303). After Part3 is merged, this PR will be rebased to only contain Part4-specific changes. ## Test plan - [x] Unit tests for `task_refresh_external_collection.go` (28 tests) - [x] Unit tests for `task_update.go` and fragment utilities (40 tests) - [x] Unit tests for FFI bridges (`exttable_test.go`, 9 tests) - [x] Unit tests for `ExternalSpec` parser - [x] Unit tests for paramtable config - [x] Integration test with real Parquet files - [x] `make lint-fix` passes - [ ] E2E test with MinIO backend --------- Signed-off-by: Jiquan Long <jiquan.long@zilliz.com> Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-03-04 18:09:21 +08:00
Zhen Ye	46a43fc3a5	fix: fast-fail ServerIDMismatch for node connections and increase walBalancer operationTimeout (#47981 ) For node connections (isNode=true), ServerIDMismatch now returns needRetry=false immediately instead of retrying 10 times with exponential backoff (~52.6s). Retrying is futile because the NodeID injected via the interceptor at connection time never changes during retry. Coord connections keep existing retry behavior. Also increase streaming.walBalancer.operationTimeout default from 30s to 30m. issue: #46182 Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 22:41:20 +08:00
congqixia	1bd65fc1ce	enhance: remove deprecated lazy load code (#47590 ) Related to #44452 Remove the deprecated lazy load feature which has been superseded by warmup-related parameters. This cleanup includes: - Remove AddFieldDataInfoForSealed from C++ segcore layer - Remove IsLazyLoad() method and isLazyLoad field from segment - Remove lazy load checks in proxy alterCollectionTask - Remove DiskCache lazy load handling in search/retrieve paths - Remove LazyLoadEnableKey constant and related helper functions - Update mock files to reflect interface changes --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2026-02-10 14:14:44 +08:00
Zhen Ye	ae16d43061	fix: less empty time tick filtering interval (#47470 ) issue: #46540 Signed-off-by: chyezh <chyezh@outlook.com>	2026-02-03 15:38:09 +08:00
Zhen Ye	670f2cc5e8	enhance: streaming service is enabled until the streaming node number is reached (#46981 ) issue: #46980 Signed-off-by: chyezh <chyezh@outlook.com>	2026-01-12 14:47:26 +08:00
yihao.dai	9d9fe2273a	enhance: Always retry writing binlogs (#46850 ) issue: https://github.com/milvus-io/milvus/issues/46848 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2026-01-07 16:07:24 +08:00
cai.zhang	0c200ff781	enhance:Limit the number of concurrent vector index builds per worker (#46773 ) issue: #46772 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2026-01-07 15:47:25 +08:00
Zhen Ye	c7b5c23ff6	enhance: filter the empty timetick from consuming side (#46541 ) issue: #46540 Empty timetick is just used to sync up the time clock between different component in milvus. So empty timetick can be ignored if we achieve the lsn/mvcc semantic for timetick. Currently, some components need the empty timetick to trigger some operation, such as flush/tsafe. So we only slow down the empty time tick for 5 seconds. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: with LSN/MVCC semantics consumers only need (a) the first timetick that advances the latest-required-MVCC to unblock MVCC-dependent waits and (b) occasional periodic timeticks (~≤5s) for clock synchronization—therefore frequent non-persisted empty timeticks can be suppressed without breaking MVCC correctness. - Logic removed/simplified: per-message dispatch/consumption of frequent non-persisted empty timeticks is suppressed — an MVCC-aware filter emptyTimeTickSlowdowner (internal/util/pipeline/consuming_slowdown.go) short-circuits frequent empty timeticks in the stream pipeline (internal/util/pipeline/stream_pipeline.go), and the WAL flusher rate-limits non-persisted timetick dispatch to one emission per ~5s (internal/streamingnode/server/flusher/flusherimpl/wal_flusher.go); the delegator exposes GetLatestRequiredMVCCTimeTick to drive the filter (internal/querynodev2/delegator/delegator.go). - Why this does NOT introduce data loss or regressions: the slowdowner always refreshes latestRequiredMVCCTimeTick via GetLatestRequiredMVCCTimeTick and (1) never filters timeticks < latestRequiredMVCCTimeTick (so existing tsafe/flush waits stay unblocked) and (2) always lets the first timetick ≥ latestRequiredMVCCTimeTick pass to notify pending MVCC waits; separately, WAL flusher suppression applies only to non-persisted timeticks and still emits when the 5s threshold elapses, preserving periodic clock-sync messages used by flush/tsafe. - Enhancement summary (where it takes effect): adds GetLatestRequiredMVCCTimeTick on ShardDelegator and LastestMVCCTimeTickGetter, wires emptyTimeTickSlowdowner into NewPipelineWithStream (internal/util/pipeline), and adds WAL flusher rate-limiting + metrics (internal/streamingnode/server/flusher/flusherimpl/wal_flusher.go, pkg/metrics) to reduce CPU/dispatch overhead while keeping MVCC correctness and periodic synchronization. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: chyezh <chyezh@outlook.com>	2026-01-06 20:53:24 +08:00
wei liu	5f2e430941	enhance: Add channel-based node blacklist for LB policy retry (#46091 ) issue: #46090 This change introduces a global node blacklist mechanism to immediately cut off query traffic to failed delegators across all concurrent requests. Key features: - Introduce ChannelBlacklist to track failed delegator nodes per channel - When a query fails, the node is immediately blacklisted and excluded from ALL subsequent requests (not just retries within the same request) - Blacklisted nodes are automatically excluded during node selection - Entries expire after configurable duration (default 30s) to allow automatic recovery when nodes become healthy again - Background cleanup loop removes expired entries periodically - Add proxy.replicaBlacklistDuration and proxy.replicaBlacklistCleanupInterval configuration parameters - Blacklist can be disabled by setting duration to 0 Before this change: - Failed nodes were only excluded within the same request's retry loop - Concurrent requests would still attempt to query the failed node - Each request had to experience its own failure before avoiding the node After this change: - Once a node fails, it is immediately excluded from all requests - New requests arriving during the blacklist period will skip the failed node without experiencing any failure - This significantly reduces latency spikes during node failures Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-01-06 11:01:29 +08:00
wei liu	975c91df16	feat: Add comprehensive snapshot functionality for collections (#44361 ) issue: #44358 Implement complete snapshot management system including creation, deletion, listing, description, and restoration capabilities across all system components. Key features: - Create snapshots for entire collections - Drop snapshots by name with proper cleanup - List snapshots with collection filtering - Describe snapshot details and metadata Components added/modified: - Client SDK with full snapshot API support and options - DataCoord snapshot service with metadata management - Proxy layer with task-based snapshot operations - Protocol buffer definitions for snapshot RPCs - Comprehensive unit tests with mockey framework - Integration tests for end-to-end validation Technical implementation: - Snapshot metadata storage in etcd with proper indexing - File-based snapshot data persistence in object storage - Garbage collection integration for snapshot cleanup - Error handling and validation across all operations - Thread-safe operations with proper locking mechanisms <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant/assumption: snapshots are immutable point‑in‑time captures identified by (collection, snapshot name/ID); etcd snapshot metadata is authoritative for lifecycle (PENDING → COMMITTED → DELETING) and per‑segment manifests live in object storage (Avro / StorageV2). GC and restore logic must see snapshotRefIndex loaded (snapshotMeta.IsRefIndexLoaded) before reclaiming or relying on segment/index files. - New capability added: full end‑to‑end snapshot subsystem — client SDK APIs (Create/Drop/List/Describe/Restore + restore job queries), DataCoord SnapshotWriter/Reader (Avro + StorageV2 manifests), snapshotMeta in meta, SnapshotManager orchestration (create/drop/describe/list/restore), copy‑segment restore tasks/inspector/checker, proxy & RPC surface, GC integration, and docs/tests — enabling point‑in‑time collection snapshots persisted to object storage and restorations orchestrated across components. - Logic removed/simplified and why: duplicated recursive compaction/delta‑log traversal and ad‑hoc lookup code were consolidated behind two focused APIs/owners (Handler.GetDeltaLogFromCompactTo for delta traversal and SnapshotManager/SnapshotReader for snapshot I/O). MixCoord/coordinator broker paths were converted to thin RPC proxies. This eliminates multiple implementations of the same traversal/lookup, reducing divergence and simplifying responsibility boundaries. - Why this does NOT introduce data loss or regressions: snapshot create/drop use explicit two‑phase semantics (PENDING → COMMIT/DELETING) with SnapshotWriter writing manifests and metadata before commit; GC uses snapshotRefIndex guards and IsRefIndexLoaded/GetSnapshotBySegment/GetSnapshotByIndex checks to avoid removing referenced files; restore flow pre‑allocates job IDs, validates resources (partitions/indexes), performs rollback on failure (rollbackRestoreSnapshot), and converts/updates segment/index metadata only after successful copy tasks. Extensive unit and integration tests exercise pending/deleting/GC/restore/error paths to ensure idempotence and protection against premature deletion. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-01-06 10:15:24 +08:00
cai.zhang	a16d04f5d1	feat: Support ttl field for entity level expiration (#46342 ) issue： #46033 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Pull Request Summary: Entity-Level TTL Field Support ### Core Invariant and Design This PR introduces per-entity TTL (time-to-live) expiration via a dedicated TIMESTAMPTZ field as a fine-grained alternative to collection-level TTL. The key invariant is mutual exclusivity: collection-level TTL and entity-level TTL field cannot coexist on the same collection. Validation is enforced at the proxy layer during collection creation/alteration (`validateTTL()` prevents both being set simultaneously). ### What Is Removed and Why - Global `EntityExpirationTTL` parameter removed from config (`configs/milvus.yaml`, `pkg/util/paramtable/component_param.go`). This was the only mechanism for collection-level expiration. The removal is safe because: - The collection-level TTL path (`isEntityExpired(ts)` check) remains intact in the codebase for backward compatibility - TTL field check (`isEntityExpiredByTTLField()`) is a secondary path invoked only when a TTL field is configured - Existing deployments using collection TTL can continue without modification The global parameter was removed specifically because entity-level TTL makes per-entity control redundant with a collection-wide setting, and the PR chooses one mechanism per collection rather than layering both. ### No Data Loss or Behavior Regression TTL filtering logic is additive and safe: 1. Collection-level TTL unaffected: The `isEntityExpired(ts)` check still applies when no TTL field is configured; callers of `EntityFilter.Filtered()` pass `-1` as the TTL expiration timestamp when no field exists, causing `isEntityExpiredByTTLField()` to return false immediately 2. Null/invalid TTL values treated safely: Rows with null TTL or TTL ≤ 0 are marked as "never expire" (using sentinel value `int64(^uint64(0) >> 1)`) and are preserved across compactions; percentile calculations only include positive TTL values 3. Query-time filtering automatic: TTL filtering is transparently added to expression compilation via `AddTTLFieldFilterExpressions()`, which appends `(ttl_field IS NULL OR ttl_field > current_time)` to the filter pipeline. Entities with null TTL always pass the filter 4. Compaction triggering granular: Percentile-based expiration (20%, 40%, 60%, 80%, 100%) allows configurable compaction thresholds via `SingleCompactionRatioThreshold`, preventing premature data deletion ### Capability Added: Per-Entity Expiration with Data Distribution Awareness Users can now specify a TIMESTAMPTZ collection property `ttl_field` naming a schema field. During data writes, TTL values are collected per segment and percentile quantiles (5-value array) are computed and stored in segment metadata. At query time, the TTL field is automatically filtered. At compaction time, segment-level percentiles drive expiration-based compaction decisions, enabling intelligent compaction of segments where a configurable fraction of data has expired (e.g., compact when 40% of rows are expired, controlled by threshold ratio). <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2026-01-05 10:27:24 +08:00
wei liu	293838bb67	enhance: add delegator catching up streaming data state tracking (#46551 ) issue: #46550 - Add CatchUpStreamingDataTsLag parameter to control tolerable lag threshold for delegator to be considered caught up - Add catchingUpStreamingData field in delegator to track whether delegator has caught up with streaming data - Add catching_up_streaming_data field in LeaderViewStatus proto - Check catching up status in CheckDelegatorDataReady, return not ready when delegator is still catching up streaming data - Add unit tests for the new functionality When tsafe lag exceeds the threshold, the distribution will not be considered serviceable, preventing queries from timing out in waitTSafe. This is useful when streaming message queue consumption is slow. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: a delegator must not be considered serviceable while its tsafe lags behind the latest committed timestamp beyond a configurable tolerance; a delegator is "caught-up" only when (latestTsafe - delegator.GetTSafe()) < CatchUpStreamingDataTsLag (configured by queryNode.delegator.catchUpStreamingDataTsLag, default 1s). - New capability and where it takes effect: adds streaming-catchup tracking to QueryNode/QueryCoord — an atomic catchingUpStreamingData flag on shardDelegator (internal/querynodev2/delegator/delegator.go), a new param CatchUpStreamingDataTsLag (pkg/util/paramtable/component_param.go), and a LeaderViewStatus.catching_up_streaming_data field in the proto (pkg/proto/query_coord.proto). The flag is exposed in GetDataDistribution (internal/querynodev2/services.go) and used by QueryCoord readiness checks (internal/querycoordv2/utils/util.go::CheckDelegatorDataReady) to reject leaders that are still catching up. - What logic is simplified/added (not removed): instead of relying solely on segment distribution/worker heartbeats, the PR adds an explicit readiness gate that returns "not available" when the delegator reports catching-up-streaming-data. This is strictly additive — no existing checks are removed; the new precondition runs before segment availability validation to prevent premature routing to slow-consuming delegators. - Why this does NOT cause data loss or regress behavior: the change only controls serviceability visibility and routing — it never drops or mutates data. Concretely: shardDelegator starts with catchingUpStreamingData=true and flips to false in UpdateTSafe once the sampled lag falls below the configured threshold (internal/querynodev2/delegator/delegator.go::UpdateTSafe). QueryCoord will short-circuit in CheckDelegatorDataReady when leader.Status.GetCatchingUpStreamingData() is true (internal/querycoordv2/utils/util.go), returning a channel-not-available error before any segment checks; when the flag clears, existing segment-distribution checks (same code paths) resume. Tests added cover both catching-up and caught-up paths (internal/querynodev2/delegator/delegator_test.go, internal/querycoordv2/utils/util_test.go, internal/querynodev2/services_test.go), demonstrating convergence without changed data flows or deletion of data. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-29 17:15:21 +08:00
yihao.dai	5b97cb70a0	enhance: Support delaying scanner startup (#46369 ) Introduce a ScannerStartupDelay configuration to enable WAL write-only recovery, allowing fence messages to be persisted during primary–secondary switchover when the StreamingNode is trapped in crash loops. issue: https://github.com/milvus-io/milvus/issues/46368 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added a configurable WAL scanner pause/resume and a consumer request flag to optionally ignore pause signals. * Metrics * Added a scanner pause gauge and pause-duration tracking for WAL scanning. * Tests * Added coverage for pause-consumption behavior and cleanup in stream client tests. * Chores * Consolidated flush-all logging into a single field and added a helper for bulk message conversion. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-24 11:53:19 +08:00
Zhen Ye	15f8dfc7ad	enhance: introduce a tolerance duration to delay the drop operation (#46251 ) issue: #46214 Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-10 19:57:13 +08:00
Zhen Ye	73fdaafb2d	fix: interleave the go and cpp log (#46004 ) issue: #45640 Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-03 14:25:11 +08:00
wei liu	e70c01362d	enhance: Add resource exhaustion querynode penalty policy (#45808 ) issue: #40513 for querynode which return resource exhausted error, add a penalty duration on it, and suspend loading new resource until penalty duration expired. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-12-02 16:59:11 +08:00
Zhen Ye	c3fe6473b8	enhance: support async write syncer for milvus logging (#45805 ) issue: #45640 - log may be dropped if the underlying file system is busy. - use async write syncer to avoid the log operation block the milvus major system. - remove some log dependency from the until function to avoid dependency-loop. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-28 17:43:11 +08:00
tinswzy	1427825133	enhance: improve WAL retention strategy (#45350 ) issue: #44369 woodpecker related[ issue: #59](https://github.com/zilliztech/woodpecker/issues/59) Refactor the WAL retention logic in Milvus StreamingNode: - Remove the simple sampling-based truncation mechanism. - After flush, WAL data is directly truncated. - The retention control is now delegated to the underlying message queue (MQ) implementation. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-11-23 21:41:05 +08:00
Zhen Ye	c8073eb90b	fix: panic when double close channel of ack broadcast (#45661 ) issue: #45635 Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-19 14:25:05 +08:00
Zhen Ye	309d564796	enhance: support collection and index with WAL-based DDL framework (#45033 ) issue: #43897 - Part of collection/index related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal, CreateCollection, DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex, DropIndex. - Part of collection/index related DDL can be synced by new CDC now. - Refactor some UT for collection/index DDL. - Add Tombstone scheduler to manage the tombstone GC for collection or partition meta. - Move the vchannel allocation into streaming pchannel manager. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-30 14:24:08 +08:00
yihao.dai	f61952adfc	fix: Fix compaction task blocking due to executor loop exit (#44543 ) 1. Use goroutine pool instead of sem. 2. Remove compaction executor from pipeline, since in streaming mode pipeline should be decoupled from compaction. issue: https://github.com/milvus-io/milvus/issues/44541 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-09-28 11:03:04 +08:00
Zhen Ye	19e5e9f910	enhance: broadcaster will lock resource until message acked (#44508 ) issue: #43897 - Return LastConfirmedMessageID when wal append operation. - Add resource-key-based locker for broadcast-ack operation to protect the coord state when executing ddl. - Resource-key-based locker is held until the broadcast operation is acked. - ResourceKey support shared and exclusive lock. - Add FastAck execute ack right away after the broadcast done to speed up ddl. - Ack callback will support broadcast message result now. - Add tombstone for broadcaster to avoid to repeatedly commit DDL and ABA issue. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-24 20:58:05 +08:00
jiaqizho	338ed2fed4	enhance: Introduce sparse filter in query (#44347 ) issue: #44373 The current commit implements sparse filtering in query tasks using the statistical information (Bloom filter/MinMax) of the Primary Key (PK). The statistical information of the PK is bound to the segment during the segment loading phase. A new filter has been added to the segment filter to enable the sparse filtering functionality. Signed-off-by: jiaqizho <jiaqi.zhou@zilliz.com>	2025-09-23 09:58:09 +08:00
Bingyi Sun	94d53a5ac6	feat: encode cluster id in auto id (#44471 ) https://github.com/milvus-io/milvus/issues/44326 prev: [physical_ts][logical_ts] after [sign_bit][cluster_id][physical_ts][logical_ts] --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-09-22 10:40:02 +08:00
wei liu	6d4961b978	enhance: Refactor balance checker with priority queue (#43992 ) issue: #43858 Refactor the balance checker implementation to use priority queues for managing collection balance operations, improving processing efficiency and order control. Changes include: - Export priority queue interfaces (Item, BaseItem, PriorityQueue) - Replace collection round-robin with priority-based queue system - Add BalanceCheckCollectionMaxCount configuration parameter - Optimize balance task generation with batch processing limits - Refactor processBalanceQueue method for different strategies - Enhance test coverage with comprehensive unit tests The new priority queue system processes collections based on row count or collection ID order, providing better control over balance operation priorities and resource utilization. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-09-19 17:46:01 +08:00
Bingyi Sun	5cd2d99799	enhance: Revert "feat: encode cluster id in auto id (#44324 )" (#44426 ) This reverts commit `7af1594103`	2025-09-17 17:56:01 +08:00
Bingyi Sun	7af1594103	feat: encode cluster id in auto id (#44324 ) https://github.com/milvus-io/milvus/issues/44326 prev: `[physical_ts][logical_ts]` after `[sign_bit][cluster_id][physical_ts][logical_ts]` --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2025-09-17 16:56:01 +08:00
Zhen Ye	a86b6f2a54	enhance: extend the stats manage at streaming shard manager for L0 (#43371 ) issue: #42416 - Rename the InsertMetric into ModifiedMetric. - Add L0 control configuration. - Add some L0 current state collect. Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-18 20:41:46 +08:00
yihao.dai	50f621abf2	fix: Fix compaction failed due to ID exhausted (#43699 ) Change default `compaction.preAllocateIDExpansionFactor` to 10000. issue: https://github.com/milvus-io/milvus/issues/43673 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-08-01 19:17:37 +08:00
yihao.dai	a29b3272b0	fix: Improve import memory management to prevent OOM (#43568 ) 1. Use blocking memory allocation to wait until memory becomes available 2. Perform memory allocation at the file level instead of per task 3. Limit Parquet file reader batch size to prevent excessive memory consumption 4. Limit import buffer size from 20% to 10% of total memory issue: https://github.com/milvus-io/milvus/issues/43387, https://github.com/milvus-io/milvus/issues/43131 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-07-28 21:25:35 +08:00
yihao.dai	9fbd41a97d	fix: Adjust binlog and parquet reader buffer size for import (#43495 ) 1. Modify the binlog reader to stop reading a fixed 4096 rows and instead use the calculated bufferSize to avoid generating small binlogs. 2. Use a fixed bufferSize (32MB) for the Parquet reader to prevent OOM. issue: https://github.com/milvus-io/milvus/issues/43387 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-07-23 21:28:54 +08:00
Zhen Ye	07fa2cbdd3	enhance: wal balance consider the wal status on streamingnode (#43265 ) issue: #42995 - don't balance the wal if the producing-consuming lag is too long. - don't balance if the rebalance is set as false. - don't balance if the wal is balanced recently. Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-18 11:10:51 +08:00
cai.zhang	6989e18599	enhance: Move sort stats task to sort compaction (#42562 ) issue: #42560 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-07-08 20:22:47 +08:00
Zhen Ye	ed9aa1d4db	fix: limit GC concurrency as CPU number (#43165 ) issue: #42833 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-08 10:46:46 +08:00
yihao.dai	9cbd194c6b	fix: Prevent import from generating small binlogs (#43132 ) - Introduce dynamic buffer sizing to avoid generating small binlogs during import - Refactor import slot calculation based on CPU and memory constraints - Implement dynamic pool sizing for sync manager and import tasks according to CPU core count issue: https://github.com/milvus-io/milvus/issues/43131 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-07-07 21:32:47 +08:00
Zhen Ye	e97e44d56e	enhance: limit the gc concurrency when cpu is high (#43059 ) issue: #42833 Signed-off-by: chyezh <chyezh@outlook.com>	2025-07-04 09:22:43 +08:00
Zhen Ye	8367e4ec6a	fix: set 72h for wal retention (#42910 ) issue: #42706 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-27 17:36:43 +08:00
Zhen Ye	a081906fb4	enhance: smaller backoff configuration for wal balancer to make faster recovery (#42869 ) issue: #42835 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-23 10:32:40 +08:00
cai.zhang	8f8ffe9989	fix: Reduce task slot for standalone to 1/4 of normal datanode (#42808 ) issue: #42129 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2025-06-20 16:38:46 +08:00
Zhen Ye	1f66b650e9	fix: pulsar cannot work properly if backlog exceed (#42653 ) issue: #42649 - the sync operation of different pchannel is concurrent now. - add a option to notify the backlog clear automatically. - make pulsar walimpls can be recovered from backlog exceed. Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-13 14:28:37 +08:00
yihao.dai	86876682da	enhance: Enhance import integration tests and logs (#42612 ) 1. Optimize the import process: skip subsequent steps and mark the task as complete if the number of imported rows is 0. 2. Improve import integration tests: a. Add a test to verify that autoIDs are not duplicated b. Add a test for the corner case where all data is deleted c. Shorten test execution time 3. Enhance import logging: a. Print imported segment information upon completion b. Include file name in failure logs issue: https://github.com/milvus-io/milvus/issues/42488, https://github.com/milvus-io/milvus/issues/42518 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-06-12 20:02:35 +08:00
wei liu	e7c0a6ffbb	enhance: Refine QueryNode task parallelism based on CPU core count (#42166 ) issue: #42165 Implement dynamic task execution capacity calculation based on QueryNode CPU core count instead of static configuration for better resource utilization. Changes include: - Add CpuCoreNum() method and WithCpuCoreNum() option to NodeInfo - Implement GetTaskExecutionCap() for dynamic capacity calculation - Add QueryNodeTaskParallelismFactor parameter for tuning - Update proto definition to include cpu_core_num field - Add unit tests for new functionality This allows QueryCoord to automatically adjust task parallelism based on actual hardware resources. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-11 13:20:35 +08:00
Zhen Ye	43f0c56ce7	fix: limit the concurency of zstd compression and decrease the memory usage of binlog generation (#42630 ) issue: #42028 - limit the concurrency of zstd compression. - zstd.go modified from `github.com/apache/arrow/go/v17/parquet/compress/ztsd.go` - may be related to #42129 Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-11 09:06:34 +08:00
yihao.dai	837349dead	enhance: Adjust default import buffer size (#42541 ) Increase insert buffer size from 16MB to 64MB, while keeping delete buffer size at 16MB. issue: https://github.com/milvus-io/milvus/issues/42518 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-06-09 13:02:33 +08:00
wei liu	8511881d3f	enhance: Increase search/query retry times on proxy before timeout (#40438 ) issue: #39379 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-06-06 18:12:32 +08:00
Zhen Ye	0567f512b3	fix: streamingnode get stucked when stop (#42501 ) issue: #42498 - fix: sealed segment cannot be flushed after upgrading - fix: get mvcc panic when upgrading - ignore the L0 segment when graceful stop of querynode. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-06-05 12:22:31 +08:00
Zhen Ye	b94cee2413	fix: growing segment from old arch is not flushed after upgrading (#42164 ) issue: #42162 - enhance: add read ahead buffer size issue #42129 - fix: rocksmq consumer's close operation may get stucked - fix: growing segment from old arch is not flushed after upgrading --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-05-29 23:00:28 +08:00

1 2 3 4

176 Commits (17532517c611cafe5ec7a79bda47c9f296e82682)