Roundup - April 9th
Hello! I am beginning one of my brain dump/minimal review/minimal edits posts with this one! I’m going to call this series “roundup” where I just give a quick few notes on where I’m at with any personal projects. The focus of this roundup is the Raft sans-i/o implementation.
Where were we?
With Raft, I had successfully gotten to the point where I could replicate log entries across the cluster (in a sans-i/o way). This worked great and really could have been a ‘stopping’ point to start moving to integrating a higher-level event loop and then an application like a key-value store on top of it. The one problem: unbounded log growth. I didn’t support log compaction and snapshotting in that implementation and that is fine for an academic project of learning the internals of Raft but for anything that wanted to even have the chance to be offerred up for production use needed to have log compaction. So, I started to go down that road.
What happened?
My initial thoughts were to put the work on the application and event loop. To
be fair, I’m still doing that, but I had to change RaftCore (the type that
represents a Raft node). The idea with this library is that RaftCore simply owns
the Raft protocol itself. It accepts minimal input from the event loop (like
propose a new command) and emits Action’s for the event loop to take such as
SendMessage, ApplyToStateMachine, Persist*. The paper even calls out that the
snapshot semantics for Raft is up to the application itself.
That makes sense to me because Raft can be used as the consensus layer for a
number of applications like key value stores, databases, or even a simple
counter you want replicated. Raft doesn’t need to know what the application
cares about (which is also why Raft stores a command as a Vec<u8>; just a
serialized blob of whatever the application wants to store).
This means the design will be on the application to tell RaftCore “hey, it’s
time to snapshot”. The application can decide when that is the case (maybe a
certain number of new entries in a key value store). I plan to also provide some
helper functions for the event loop and application to peek at how large the
RaftCore log is at any moment. So now most of the snapshotting will be on the
application with this design.
Yeah, I totally am punting the problem down the road.
What RaftCore did need to change, though, was tracking whether or not a snapshot
exists. The idea is that if a leader has a follower that is sufficiently behind,
the leader may need to go all the way back to the snapshot for the follower to
catch up.
This, also, suggests to me that more frequent snapshotting is better so that the log in
RaftCorestays sufficiently bounded and the amount of time for “catchup” can be minimized in one fell swoop with a snapshot.
With that, now RaftCore needed to track the last index within the snapshot.
That’s fine, but the nodes are still going to work on the monotonically
increasing “index” of the log and not necessarily worry about the index
within RaftCore.log. Now, we’re working on “virtual” or “logical” indices
instead of “physical” indices in the RaftCore log.
Should be easy right? Just subtract the last index in the snapshot from the
index provided and voila, you have the index in RaftCore.log. Well, it wasn’t that
simple because I had chosen to add a sentinel or ‘dummy’ entry at the start of
every node’s log that was on term 0. That made indexing really easy within
RaftCore because I could just directly reference the index passed (i.e. index 1
in the logical Raft log was actually at index 1 in RaftCore.log). The issue I
found was that once the node had a snapshot, all the index math was off. An
example may help…
Say I have a Raft log that has 102 entries. And I snapshotted up to and
including index 100 in the log. That means the last included index in the
snapshot would be set to 100 and RaftCore’s log would be something like:
[{index: 101, term: 3}, {index: 102, term: 3}] (note, the index isn’t actually
stored in the entries of the log, but just for demonstrative purposes here). Say
I wanted index 101 in the Raft log to provide a follower. The old math / simple
math talked about would have been index - last_included_index, which in this
case is: 101 - 100 = 1 but then I wouldn’t be able to go to RaftCore.log[1]
because that’s index 102 in the Raft log! All the index math I was doing was
fine until snapshots came up.
So, with great sadness, I added some helpers to do the math and removed the
sentinel entry to make sure it was consistent math regardless if there was a
snapshot or not. Then, I changed up any direct indexing into the log to use
the helpers and ensured I wasn’t using RaftCore.log.len() as a way to compute
the “last” index in the log and rather use the helper that took into account
the snapshot’s log. This created a lot of problems at first (I actually had over
half of the tests failing). Guess what the bulk of the problems were? You
probably guessed it, off-by-one errors. I had a lot of places where a < should
have been a <= or I missed appending -1 to a computation. I ended up adding
in tests for the helpers too to make sure my sanity was correct (Claude made the
suggestion after watching the struggle). Eventually, I cleared out those issues
and hopefully tested all of the edge cases and have this ready to go.
The funny part is that this change that took a good chunk of time didn’t really
do anything beyond just using the helpers. Snapshotting and log compaction is
still NOT supporting in RaftCore but all of the scaffolding is there to make
it ready!
Where next?
Next will be to actually do log compaction and snapshotting between the nodes.
I’ll iron out the API between the application, event loop, and RaftCore and
hopefully will write some solid tests that ensure the Raft protocol is still
upheld when I added in the snapshots.
After that, I think the last piece I’ll tackle is cluster membership changes. I haven’t dove super deep into it in the paper but have learned and guessed that that is one of the harder problems within Raft. I may also pause on that for the moment and work on the higher levels (getting through to a key value store). However, I won’t consider this project ready for “production” until cluster membership changes can be supported. Raft is great until you can’t manipulate the cluster when you start to realize you need more replication to tolerate more failures.
I’m using the term “production” loosely because I have no idea if this would be considered for production use. I’ll use it maybe to test on real AWS nodes, but I doubt anyone will want to use this.
That’s all for now! The next roundup should have log compaction done!
Roundup Series
In this section of the Roundup’s, I’ll post the “series” so you can cycle through!
| Previous Roundup | Next Roundup |
|---|---|
| NONE! | Coming Soon! |