Merge pull request #6324 from influxdata/dom/drop-wal-segments

feat(ingester2): drop WAL segments after persist
pull/24376/head
Dom 2022-12-02 17:42:09 +00:00 committed by GitHub
commit 533a6581be
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 53 additions and 0 deletions

View File

@ -29,6 +29,49 @@ pub(crate) async fn periodic_rotation(
"rotated wal"
);
// TEMPORARY HACK: wait 5 seconds for in-flight writes to the old WAL
// segment to complete before draining the partitions.
//
// This can occur because writes to the WAL & buffer tree are not atomic
// (avoiding a serialising mutex in the write path).
//
// A flawed solution would be to have this code read the current
// SequenceNumber after rotation, and then wait until at least that
// sequence number has been buffered in the BufferTree. This may work in
// most cases, but is racy / not deterministic - writes are not ordered,
// so sequence number 5 might be buffered before sequence number 1.
//
// As a temporary hack, wait 5 seconds for in-flight writes to complete
// (which should be more than enough time) before proceeding under the
// assumption that they have indeed completed, and all writes from the
// previous WAL segment are now buffered. Because they're buffered, the
// persist operation performed next will persist all the writes that
// were in the previous WAL segment, and therefore at the end of the
// persist operation the WAL segment can be dropped.
//
// The potential downside of this hack is that in the very unlikely
// situation that an in-flight write has not completed before the
// persist operation starts (after the 5 second sleep) and the WAL entry
// for it is dropped - we then reduce the durability of that write until
// it is persisted next time, or it is lost after an ingester crash
// before the next rotation.
//
// In the future, a proper fix will be to keep the set of sequence
// numbers wrote to each partition buffer, and each WAL segment as a
// bitmap, and after persistence submit the partition's bitmap to the
// WAL for it to do a set difference to derive the remaining sequence
// IDs, and therefore number of references to the WAL segment. Once the
// set of remaining IDs is empty (all data is persisted), the segment is
// safe to delete. This content-addressed reference counting technique
// has the added advantage of working even with parallel / out-of-order
// / hot partition persists that span WAL segments, and means there's no
// special code path between "hot partition persist" and "wal rotation
// persist" - it all works the same way!
//
// TODO: this properly as described above.
tokio::time::sleep(Duration::from_secs(5)).await;
// Drain the BufferTree of partition data and persist each one.
//
// Writes that landed into the partition buffer after the rotation but
@ -86,6 +129,16 @@ pub(crate) async fn periodic_rotation(
closed_id = %stats.id(),
"partitions persisted"
);
handle
.delete(stats.id())
.await
.expect("failed to drop wal segment");
info!(
closed_id = %stats.id(),
"dropped persisted wal segment"
);
}
}