Modified Consistent Hashing Rings in OpenStack Swift

Now let’s talk about how Swift takes a slightly different approach in consistent hashing algorithm, and talk about the importance of rings in Swift.

Please note: After referring to Swift articles few times, it is my belief that the terms drive and devices are used interchangeably, so I’ll be doing the same here.

On the last entry, I mentioned how some objects have to move to another drive when I add or remove a drive from the cluster. What happens to that object being transported? It won’t be available for use, and we wouldn’t want to wait for that object to move to another drive; we want it to be available at all times. Swift builds the ring (and rebuilds it by executing the rebalance command when necessary) to ensure the data is distributed throughout the ring evenly.

As far as I know, there are 3 types of rings that I have read so far: account rings, container rings, and object rings. To read more about what I found about the relations among account, container, and object, see Basic Architecture and Data Model of OpenStack Swift.

1. Partitions and Partition Power

Swift uses a partition, each with a fixed width. Those partitions are then assigned to drives by using a placement algorithm.

While Swift creates a cluster, it picks an integer – a partition power. It uses the value of partition power to calculate the total number of partitions to assign to the drives in the cluster. Let N = total number of partitions, and p = partition power:

N = 2p

Let’s say that Swift chose 7 for the partition power. In that case, the cluster will have 27 = 128 partitions, which will then be mapped to the available drives in the cluster. What’s good about this is that in the cluster, the number of partitions will stay the same at all times, although the number of drives may change, whether it is added or removed.

But that’s not all.

2. Replica Count and Replica Locks

That’s what Replica count is the number of partition copies – this is what makes Swift redundant; it keeps multiple copies of each partition to place across the cluster. Say that I have a replica count of three: Each partition will have 3 copies, and each copy of those partitions will be distributed among different devices in the cluster – helps with redundancy.

It helps us to have higher replica count of partitions; it keeps us more protected against losing data, or data not being available. Should the device be added or removed, and the partitions are moving to different devices, we still have other replica available.

Let’s say that I’m moving a partition, let’s call it partition_1, to another device. While one copy of partition_1 is being moved, replicas of that partition_1 should not be moved, so Swift uses replica locks to lock those replicas, so they won’t be able to move to another device to ensure availability of those replicas.

3. Data Distribution

Swift uses two other mechanisms to evenly distribute the data across the cluster.

3.1. Weight

Remember that I mentioned that data needs to be evenly distributed to help with the load balance when I talked about consistent hashing algorithm? (See Multiple Markers in Consistent Hashing Algorithm section) It appears the first mechanism, weight, helps the cluster to decide which partition power to choose, and to calculate a specific number of partitions needs to be assigned to each drive. This is a user-defined value during the cluster creation and also used when re-balancing the rings: that is, Swift will re-build the rings to evenly distribute data among different drives. Higher weight means higher number of partitions needs to be created, for one, so higher number of partitions need to be assigned to the drives.

3.2. Unique-as-possible

Second mechanism that Swift uses is more like a placement algorithm, called unique-as-possible. Simply put, this is an algorithm that finds the region ⇒ zone ⇒ ( <ip-address>:<port-number> ) that are not used as much compared to other regions, zones, and servers, in that order. If necessary, it will also find the drive that is not used as much. Once found, Swift places the partitions in them.

4. Ring Data Structures

Every time the ring is built (and rebuilt), it seems that two important data structures are also created as well: device list and device look-up table. Knowing that proxy server handles the REST calls from client, it is my belief that the proxy server relies on these two data structures to deal with the incoming/outgoing objects accordingly.

4.1. Device Look-up Table

Device Look-up Table contains an information that proxy server process looks up to find which device a certain data is located in. Say the client sends a GET request to download an object. It would calculate the hash value of that object sent with the GET request to map to the specific partition value. Also remember, the partitions are replicated and then mapped to different drives, so the process would be directed to the correct devices containing the object.

Each row in the Device Look-up Table represents the replicas (replica 0 being the first, 1 being second, and so on). Each column in the device look-up table represents the partition. Each data in the table represents the drive ID. Given that, the process looks at the device where the first replica is located, and then the next n – 1 rows, n being the number of replicas present in the cluster.

Example of device look-up table:

device lookup table

In the table above, we can see that the data was found in partition 2. Replicas 0, 1, and 2 are located in partition 2, which are mapped to the drives 5, 2, and 15.

4.2. Device List

Device List contains a list of active devices in the cluster. If we look more into the Swift architecture and its data models, this will probably make a lot more sense. Each device (which maps the partitions) belongs to the storage node, which in turn belongs to a zone. Each individual zone belongs to a region. That region is a typical geographical location that are user-defined values, when they are prompted to provide values for country, city, (and some others) while creating a Swift cluster. Above all, those regions all fall into one Swift cluster (see Basic Architecture and Data Models of OpenStack Swift for more details)

So the point of a device list is to contain information about each device: what region/zone they fall in, its weight, device ID, etc. I believe that proxy server uses this information to handle the REST calls, and refer the objects from/to the correct devices.

Example of device list:

0 1 2 3
Devices device id=0
device id=1
device id=2
device id=3

Going back to the earlier example with device look-up table when the proxy server process found the data in partition 2, it also found the first replica of the data was found in partition 2 of the drive 2. So then it will refer to the device list and look at the device 2, see that it is located in region 1, zone 3, and so forth.

Consistent Hashing Algorithm

Today, I’ll talk about how the Consistent Hashing Algorithm works, which will be followed by how it is utilized in OpenStack Swift Rings in my next blog entry. To start off, let’s talk about the Basic Hash Function.

1. Basic Hash Function

Basic hash function maps the objects into different drives based on the hash of an object so it can be fetched later. It is probably best to think of the basic hashing algorithm as a typical encyclopedia; it doesn’t matter how many times I look up the information on computers, I will always be checking for a CAL – EDU volume (as long as I am looking through the same edition). Think of the encyclopedia volume as a drive, and the encyclopedia volume number as a mapping value.

Let’s say I have a series of objects I want to store, and I have 4 drives (or partitions) in my storage server, which are labeled Drive 0, 1, 2 and 3. In basic hash function, it will take the object data and hash it using the MD5 hashing algorithm, to produce a shorter, fixed length. It generates a hexadecimal value of length 32, which can be converted into decimal value. Finally, it will divide that hash value by the number of drives; in our example, it is 4, because I have 4 drives. It then stores that object based on the remainder of the division, which will be any value from 0 to 3 – let’s call this value a mapping value.

Here are the example objects I want to store and their hash values:

Table 1.1. Mapping of Objects to Different Drives using Basic Hash Function

Mapping of Objects to Different Drives
Object Hash Value (Hexadecimal) Mapping Value Drive Mapped To
Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 hash(Image 1) % 4 = 2 Drive 2
Image 2 943359f44dc87f6a16973c79827a038c hash(Image 2) % 4 = 3 Drive 3
Image 3 1213f717f7f754f050d0246fb7d6c43b hash(Image 3) % 4 = 3 Drive 3
Music 1 4b46f1381a53605fc0f93a93d55bf8be hash(Music 1) % 4 = 1 Drive 1
Music 2 ecb27b466c32a56730298e55bcace257 hash(Music 2) % 4 = 0 Drive 0
Music 3 508259dfec6b1544f4ad6e4d52964f59 hash(Music 3) % 4 = 0 Drive 0
Movie 1 69db47ace5f026310ab170b02ac8bc58 hash(Movie 1) % 4 = 2 Drive 2
Movie 2 c4abbd49974ba44c169c220dadbdac71 hash(Movie 2) % 4 = 1 Drive 1

object mapping basic hash

But what if we have to add/remove drives? The hash values of all objects will stay the same, but we need to re-compute the mapping value for all objects, then re-map them to the different drives.

That’s too much work for our servers.

2. Consistent Hashing Algorithm

Consistent hashing algorithm achieves a similar goal but does things differently. It will still hash the object data, but instead of getting the mapping value of each object, each drive will be assigned a range of hash values to store the objects. Again, think of this as an encyclopedia; each volume will be the drive, except that the range of first 3 letters of information each volume contains is like the hash value of each object mapped to a drive accordingly.

Table 2.1. Range of Hash Values for Each Drive

Range of Hash Values for Each Drive
Drive Range of Hash Values
Drive 0 0000… ~ 3fff…
Drive 1 3fff… ~ 7ffe…
Drive 2 7fff… ~ bffd…
Drive 3 bffd… ~ ffff…

Note: This is just an example. Hash values are much longer.

Table 2.2. Range of Hash Values for Each Drive

Mapping of Objects to Different Drives
Object Hash Value (Hexadecimal) Drive Mapped To
Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 Drive 2
Image 2 943359f44dc87f6a16973c79827a038c Drive 2
Image 3 1213f717f7f754f050d0246fb7d6c43b Drive 0
Music 1 4b46f1381a53605fc0f93a93d55bf8be Drive 1
Music 2 ecb27b466c32a56730298e55bcace257 Drive 3
Music 3 508259dfec6b1544f4ad6e4d52964f59 Drive 1
Movie 1 69db47ace5f026310ab170b02ac8bc58 Drive 1
Movie 2 c4abbd49974ba44c169c220dadbdac71 Drive 3

object mapping simple consistent hashing

Now if I added additional drives, only thing that changes is each drive will get a new range of hash values it is going to store. Each object’s hash value will still remain the same. Any objects whose hash value is within range of its current drive will remain. For any other objects whose hash value is not within range of its current drive will be mapped to another drive; but that number of objects is very few using consistent hashing algorithm, compared to the basic hash function.

I’ll add another drive and re-illustrate my point on the picture below:

object mapping simple consistent hashing 2

Notice how only Movie 2 and Music 2 objects were mapped to my new drive (drive 4), and Image 1 had to be mapped to drive 3. If we used basic hash function, we would most likely have to re-calculate the mapping values for all objects, and re-map them accordingly. Imagine how much workload that is for thousands, or even millions of objects.

But there’s more to it once it’s modified.

3. Multiple Markers in Consistent Hashing Algorithm

First, let’s look at what the multiple markers do for us.

Remember in consistent hashing algorithm, each drive has one big range of hash values to map the objects. Multiple markers helps to evenly distribute the objects into drives, thus helping with the load balancing, but how?

Instead of having one big hash range for each drive, multiple markers serve to split those large hash range into smaller chunks, and those smaller hash ranges will be assigned to different drives in the server. How does that help?

Let’s say I have 20 objects I want to store, and I still have 4 drives, each with different range of hash values of equal length. But what if out of those 20 objects, maybe 14 are mapped to drive 0, and the rest are equally distributed to drives 1, 2, and 3? This causes the ring to be unbalanced in weight, because drive 0 holds much more hash values than the rest of the drives. This is where the smaller hash ranges can help a lot with load balancing.

As mentioned earlier, consistent hashing algorithm uses multiple markers for the drives to map several smaller ranges of hash values instead of one big range. This has two positive effects: First, if the new drive was to be added, that new drive will gain more objects from all other existing drives in the server, instead of just a few objects from a neighboring drive – this results in more and smaller hash ranges. Likewise, if one of the existing drive was to be removed, all objects that drive was holding onto will be evenly distributed to the other existing drives – results in less and larger hash ranges. Second, by doing this, the overall distribution of objects will be fairly even, meaning the weight among different drives will be very close to evenly distributed – helps with load balancing.

object mapping modified consistent hashing

Picture above shows several objects close to each other in terms of its hash value are distributed among different segments of the different drives. Multiple markers splits 4 big hash ranges into several smaller hash ranges and assigns them into all other drives.

On my next entry, I will talk about how Swift utilizes this algorithm and how it takes a different approach.