6 minute read

Hello! I am beginning one of my brain dump/minimal review/minimal edits posts with this one! I’m going to call this series “roundup” where I just give a quick few notes on where I’m at with any personal projects. The focus of this roundup is the Raft sans-i/o implementation.

Where were we?

With Raft, I had successfully gotten to the point where I could replicate log entries across the cluster (in a sans-i/o way). This worked great and really could have been a ‘stopping’ point to start moving to integrating a higher-level event loop and then an application like a key-value store on top of it. The one problem: unbounded log growth. I didn’t support log compaction and snapshotting in that implementation and that is fine for an academic project of learning the internals of Raft but for anything that wanted to even have the chance to be offerred up for production use needed to have log compaction. So, I started to go down that road.

What happened?

My initial thoughts were to put the work on the application and event loop. To be fair, I’m still doing that, but I had to change RaftCore (the type that represents a Raft node). The idea with this library is that RaftCore simply owns the Raft protocol itself. It accepts minimal input from the event loop (like propose a new command) and emits Action’s for the event loop to take such as SendMessage, ApplyToStateMachine, Persist*. The paper even calls out that the snapshot semantics for Raft is up to the application itself.

That makes sense to me because Raft can be used as the consensus layer for a number of applications like key value stores, databases, or even a simple counter you want replicated. Raft doesn’t need to know what the application cares about (which is also why Raft stores a command as a Vec<u8>; just a serialized blob of whatever the application wants to store).

This means the design will be on the application to tell RaftCore “hey, it’s time to snapshot”. The application can decide when that is the case (maybe a certain number of new entries in a key value store). I plan to also provide some helper functions for the event loop and application to peek at how large the RaftCore log is at any moment. So now most of the snapshotting will be on the application with this design.

Yeah, I totally am punting the problem down the road.

What RaftCore did need to change, though, was tracking whether or not a snapshot exists. The idea is that if a leader has a follower that is sufficiently behind, the leader may need to go all the way back to the snapshot for the follower to catch up.

This, also, suggests to me that more frequent snapshotting is better so that the log in RaftCore stays sufficiently bounded and the amount of time for “catchup” can be minimized in one fell swoop with a snapshot.

With that, now RaftCore needed to track the last index within the snapshot. That’s fine, but the nodes are still going to work on the monotonically increasing “index” of the log and not necessarily worry about the index within RaftCore.log. Now, we’re working on “virtual” or “logical” indices instead of “physical” indices in the RaftCore log.

Should be easy right? Just subtract the last index in the snapshot from the index provided and voila, you have the index in RaftCore.log. Well, it wasn’t that simple because I had chosen to add a sentinel or ‘dummy’ entry at the start of every node’s log that was on term 0. That made indexing really easy within RaftCore because I could just directly reference the index passed (i.e. index 1 in the logical Raft log was actually at index 1 in RaftCore.log). The issue I found was that once the node had a snapshot, all the index math was off. An example may help…

Say I have a Raft log that has 102 entries. And I snapshotted up to and including index 100 in the log. That means the last included index in the snapshot would be set to 100 and RaftCore’s log would be something like: [{index: 101, term: 3}, {index: 102, term: 3}] (note, the index isn’t actually stored in the entries of the log, but just for demonstrative purposes here). Say I wanted index 101 in the Raft log to provide a follower. The old math / simple math talked about would have been index - last_included_index, which in this case is: 101 - 100 = 1 but then I wouldn’t be able to go to RaftCore.log[1] because that’s index 102 in the Raft log! All the index math I was doing was fine until snapshots came up.

So, with great sadness, I added some helpers to do the math and removed the sentinel entry to make sure it was consistent math regardless if there was a snapshot or not. Then, I changed up any direct indexing into the log to use the helpers and ensured I wasn’t using RaftCore.log.len() as a way to compute the “last” index in the log and rather use the helper that took into account the snapshot’s log. This created a lot of problems at first (I actually had over half of the tests failing). Guess what the bulk of the problems were? You probably guessed it, off-by-one errors. I had a lot of places where a < should have been a <= or I missed appending -1 to a computation. I ended up adding in tests for the helpers too to make sure my sanity was correct (Claude made the suggestion after watching the struggle). Eventually, I cleared out those issues and hopefully tested all of the edge cases and have this ready to go.

The funny part is that this change that took a good chunk of time didn’t really do anything beyond just using the helpers. Snapshotting and log compaction is still NOT supporting in RaftCore but all of the scaffolding is there to make it ready!

Where next?

Next will be to actually do log compaction and snapshotting between the nodes. I’ll iron out the API between the application, event loop, and RaftCore and hopefully will write some solid tests that ensure the Raft protocol is still upheld when I added in the snapshots.

After that, I think the last piece I’ll tackle is cluster membership changes. I haven’t dove super deep into it in the paper but have learned and guessed that that is one of the harder problems within Raft. I may also pause on that for the moment and work on the higher levels (getting through to a key value store). However, I won’t consider this project ready for “production” until cluster membership changes can be supported. Raft is great until you can’t manipulate the cluster when you start to realize you need more replication to tolerate more failures.

I’m using the term “production” loosely because I have no idea if this would be considered for production use. I’ll use it maybe to test on real AWS nodes, but I doubt anyone will want to use this.

That’s all for now! The next roundup should have log compaction done!

Roundup Series

In this section of the Roundup’s, I’ll post the “series” so you can cycle through!

Previous Roundup Next Roundup
NONE! Coming Soon!