Merge pull request #6324 from influxdata/dom/drop-wal-segments

feat(ingester2): drop WAL segments after persist
2022-12-02 17:42:09 +00:00 · 2022-12-02 17:42:09 +00:00 · 533a6581be
parent 464fcba98f 26bf54d041
commit 533a6581be
1 changed files with 53 additions and 0 deletions
--- a/ingester2/src/wal/rotate_task.rs
+++ b/ingester2/src/wal/rotate_task.rs
@ -29,6 +29,49 @@ pub(crate) async fn periodic_rotation(
            "rotated wal"
        );

+        // TEMPORARY HACK: wait 5 seconds for in-flight writes to the old WAL
+        // segment to complete before draining the partitions.
+        //
+        // This can occur because writes to the WAL & buffer tree are not atomic
+        // (avoiding a serialising mutex in the write path).
+        //
+        // A flawed solution would be to have this code read the current
+        // SequenceNumber after rotation, and then wait until at least that
+        // sequence number has been buffered in the BufferTree. This may work in
+        // most cases, but is racy / not deterministic - writes are not ordered,
+        // so sequence number 5 might be buffered before sequence number 1.
+        //
+        // As a temporary hack, wait 5 seconds for in-flight writes to complete
+        // (which should be more than enough time) before proceeding under the
+        // assumption that they have indeed completed, and all writes from the
+        // previous WAL segment are now buffered. Because they're buffered, the
+        // persist operation performed next will persist all the writes that
+        // were in the previous WAL segment, and therefore at the end of the
+        // persist operation the WAL segment can be dropped.
+        //
+        // The potential downside of this hack is that in the very unlikely
+        // situation that an in-flight write has not completed before the
+        // persist operation starts (after the 5 second sleep) and the WAL entry
+        // for it is dropped - we then reduce the durability of that write until
+        // it is persisted next time, or it is lost after an ingester crash
+        // before the next rotation.
+        //
+        // In the future, a proper fix will be to keep the set of sequence
+        // numbers wrote to each partition buffer, and each WAL segment as a
+        // bitmap, and after persistence submit the partition's bitmap to the
+        // WAL for it to do a set difference to derive the remaining sequence
+        // IDs, and therefore number of references to the WAL segment. Once the
+        // set of remaining IDs is empty (all data is persisted), the segment is
+        // safe to delete. This content-addressed reference counting technique
+        // has the added advantage of working even with parallel / out-of-order
+        // / hot partition persists that span WAL segments, and means there's no
+        // special code path between "hot partition persist" and "wal rotation
+        // persist" - it all works the same way!
+        //
+        // TODO: this properly as described above.
+
+        tokio::time::sleep(Duration::from_secs(5)).await;
+
        // Drain the BufferTree of partition data and persist each one.
        //
        // Writes that landed into the partition buffer after the rotation but
@ -86,6 +129,16 @@ pub(crate) async fn periodic_rotation(
            closed_id = %stats.id(),
            "partitions persisted"
        );
+
+        handle
+            .delete(stats.id())
+            .await
+            .expect("failed to drop wal segment");
+
+        info!(
+            closed_id = %stats.id(),
+            "dropped persisted wal segment"
+        );
    }
 }