Modified Consistent Hashing Rings in OpenStack Swift

Now let’s talk about how Swift takes a slightly different approach in consistent hashing algorithm, and talk about the importance of rings in Swift.

Please note: After referring to Swift articles few times, it is my belief that the terms drive and devices are used interchangeably, so I’ll be doing the same here.

On the last entry, I mentioned how some objects have to move to another drive when I add or remove a drive from the cluster. What happens to that object being transported? It won’t be available for use, and we wouldn’t want to wait for that object to move to another drive; we want it to be available at all times. Swift builds the ring (and rebuilds it by executing the rebalance command when necessary) to ensure the data is distributed throughout the ring evenly.

As far as I know, there are 3 types of rings that I have read so far: account rings, container rings, and object rings. To read more about what I found about the relations among account, container, and object, see Basic Architecture and Data Model of OpenStack Swift.

1. Partitions and Partition Power

Swift uses a partition, each with a fixed width. Those partitions are then assigned to drives by using a placement algorithm.

While Swift creates a cluster, it picks an integer – a partition power. It uses the value of partition power to calculate the total number of partitions to assign to the drives in the cluster. Let N = total number of partitions, and p = partition power:

N = 2p

Let’s say that Swift chose 7 for the partition power. In that case, the cluster will have 27 = 128 partitions, which will then be mapped to the available drives in the cluster. What’s good about this is that in the cluster, the number of partitions will stay the same at all times, although the number of drives may change, whether it is added or removed.

But that’s not all.

2. Replica Count and Replica Locks

That’s what Replica count is the number of partition copies – this is what makes Swift redundant; it keeps multiple copies of each partition to place across the cluster. Say that I have a replica count of three: Each partition will have 3 copies, and each copy of those partitions will be distributed among different devices in the cluster – helps with redundancy.

It helps us to have higher replica count of partitions; it keeps us more protected against losing data, or data not being available. Should the device be added or removed, and the partitions are moving to different devices, we still have other replica available.

Let’s say that I’m moving a partition, let’s call it partition_1, to another device. While one copy of partition_1 is being moved, replicas of that partition_1 should not be moved, so Swift uses replica locks to lock those replicas, so they won’t be able to move to another device to ensure availability of those replicas.

3. Data Distribution

Swift uses two other mechanisms to evenly distribute the data across the cluster.

3.1. Weight

Remember that I mentioned that data needs to be evenly distributed to help with the load balance when I talked about consistent hashing algorithm? (See Multiple Markers in Consistent Hashing Algorithm section) It appears the first mechanism, weight, helps the cluster to decide which partition power to choose, and to calculate a specific number of partitions needs to be assigned to each drive. This is a user-defined value during the cluster creation and also used when re-balancing the rings: that is, Swift will re-build the rings to evenly distribute data among different drives. Higher weight means higher number of partitions needs to be created, for one, so higher number of partitions need to be assigned to the drives.

3.2. Unique-as-possible

Second mechanism that Swift uses is more like a placement algorithm, called unique-as-possible. Simply put, this is an algorithm that finds the region ⇒ zone ⇒ ( <ip-address>:<port-number> ) that are not used as much compared to other regions, zones, and servers, in that order. If necessary, it will also find the drive that is not used as much. Once found, Swift places the partitions in them.

4. Ring Data Structures

Every time the ring is built (and rebuilt), it seems that two important data structures are also created as well: device list and device look-up table. Knowing that proxy server handles the REST calls from client, it is my belief that the proxy server relies on these two data structures to deal with the incoming/outgoing objects accordingly.

4.1. Device Look-up Table

Device Look-up Table contains an information that proxy server process looks up to find which device a certain data is located in. Say the client sends a GET request to download an object. It would calculate the hash value of that object sent with the GET request to map to the specific partition value. Also remember, the partitions are replicated and then mapped to different drives, so the process would be directed to the correct devices containing the object.

Each row in the Device Look-up Table represents the replicas (replica 0 being the first, 1 being second, and so on). Each column in the device look-up table represents the partition. Each data in the table represents the drive ID. Given that, the process looks at the device where the first replica is located, and then the next n – 1 rows, n being the number of replicas present in the cluster.

Example of device look-up table:

device lookup table

In the table above, we can see that the data was found in partition 2. Replicas 0, 1, and 2 are located in partition 2, which are mapped to the drives 5, 2, and 15.

4.2. Device List

Device List contains a list of active devices in the cluster. If we look more into the Swift architecture and its data models, this will probably make a lot more sense. Each device (which maps the partitions) belongs to the storage node, which in turn belongs to a zone. Each individual zone belongs to a region. That region is a typical geographical location that are user-defined values, when they are prompted to provide values for country, city, (and some others) while creating a Swift cluster. Above all, those regions all fall into one Swift cluster (see Basic Architecture and Data Models of OpenStack Swift for more details)

So the point of a device list is to contain information about each device: what region/zone they fall in, its weight, device ID, etc. I believe that proxy server uses this information to handle the REST calls, and refer the objects from/to the correct devices.

Example of device list:

0 1 2 3
Devices device id=0
device id=1
device id=2
device id=3

Going back to the earlier example with device look-up table when the proxy server process found the data in partition 2, it also found the first replica of the data was found in partition 2 of the drive 2. So then it will refer to the device list and look at the device 2, see that it is located in region 1, zone 3, and so forth.


Consistent Hashing Algorithm

Today, I’ll talk about how the Consistent Hashing Algorithm works, which will be followed by how it is utilized in OpenStack Swift Rings in my next blog entry. To start off, let’s talk about the Basic Hash Function.

1. Basic Hash Function

Basic hash function maps the objects into different drives based on the hash of an object so it can be fetched later. It is probably best to think of the basic hashing algorithm as a typical encyclopedia; it doesn’t matter how many times I look up the information on computers, I will always be checking for a CAL – EDU volume (as long as I am looking through the same edition). Think of the encyclopedia volume as a drive, and the encyclopedia volume number as a mapping value.

Let’s say I have a series of objects I want to store, and I have 4 drives (or partitions) in my storage server, which are labeled Drive 0, 1, 2 and 3. In basic hash function, it will take the object data and hash it using the MD5 hashing algorithm, to produce a shorter, fixed length. It generates a hexadecimal value of length 32, which can be converted into decimal value. Finally, it will divide that hash value by the number of drives; in our example, it is 4, because I have 4 drives. It then stores that object based on the remainder of the division, which will be any value from 0 to 3 – let’s call this value a mapping value.

Here are the example objects I want to store and their hash values:

Table 1.1. Mapping of Objects to Different Drives using Basic Hash Function

Mapping of Objects to Different Drives
Object Hash Value (Hexadecimal) Mapping Value Drive Mapped To
Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 hash(Image 1) % 4 = 2 Drive 2
Image 2 943359f44dc87f6a16973c79827a038c hash(Image 2) % 4 = 3 Drive 3
Image 3 1213f717f7f754f050d0246fb7d6c43b hash(Image 3) % 4 = 3 Drive 3
Music 1 4b46f1381a53605fc0f93a93d55bf8be hash(Music 1) % 4 = 1 Drive 1
Music 2 ecb27b466c32a56730298e55bcace257 hash(Music 2) % 4 = 0 Drive 0
Music 3 508259dfec6b1544f4ad6e4d52964f59 hash(Music 3) % 4 = 0 Drive 0
Movie 1 69db47ace5f026310ab170b02ac8bc58 hash(Movie 1) % 4 = 2 Drive 2
Movie 2 c4abbd49974ba44c169c220dadbdac71 hash(Movie 2) % 4 = 1 Drive 1

object mapping basic hash

But what if we have to add/remove drives? The hash values of all objects will stay the same, but we need to re-compute the mapping value for all objects, then re-map them to the different drives.

That’s too much work for our servers.

2. Consistent Hashing Algorithm

Consistent hashing algorithm achieves a similar goal but does things differently. It will still hash the object data, but instead of getting the mapping value of each object, each drive will be assigned a range of hash values to store the objects. Again, think of this as an encyclopedia; each volume will be the drive, except that the range of first 3 letters of information each volume contains is like the hash value of each object mapped to a drive accordingly.

Table 2.1. Range of Hash Values for Each Drive

Range of Hash Values for Each Drive
Drive Range of Hash Values
Drive 0 0000… ~ 3fff…
Drive 1 3fff… ~ 7ffe…
Drive 2 7fff… ~ bffd…
Drive 3 bffd… ~ ffff…

Note: This is just an example. Hash values are much longer.

Table 2.2. Range of Hash Values for Each Drive

Mapping of Objects to Different Drives
Object Hash Value (Hexadecimal) Drive Mapped To
Image 1 b5e7d988cfdb78bc3be1a9c221a8f744 Drive 2
Image 2 943359f44dc87f6a16973c79827a038c Drive 2
Image 3 1213f717f7f754f050d0246fb7d6c43b Drive 0
Music 1 4b46f1381a53605fc0f93a93d55bf8be Drive 1
Music 2 ecb27b466c32a56730298e55bcace257 Drive 3
Music 3 508259dfec6b1544f4ad6e4d52964f59 Drive 1
Movie 1 69db47ace5f026310ab170b02ac8bc58 Drive 1
Movie 2 c4abbd49974ba44c169c220dadbdac71 Drive 3

object mapping simple consistent hashing

Now if I added additional drives, only thing that changes is each drive will get a new range of hash values it is going to store. Each object’s hash value will still remain the same. Any objects whose hash value is within range of its current drive will remain. For any other objects whose hash value is not within range of its current drive will be mapped to another drive; but that number of objects is very few using consistent hashing algorithm, compared to the basic hash function.

I’ll add another drive and re-illustrate my point on the picture below:

object mapping simple consistent hashing 2

Notice how only Movie 2 and Music 2 objects were mapped to my new drive (drive 4), and Image 1 had to be mapped to drive 3. If we used basic hash function, we would most likely have to re-calculate the mapping values for all objects, and re-map them accordingly. Imagine how much workload that is for thousands, or even millions of objects.

But there’s more to it once it’s modified.

3. Multiple Markers in Consistent Hashing Algorithm

First, let’s look at what the multiple markers do for us.

Remember in consistent hashing algorithm, each drive has one big range of hash values to map the objects. Multiple markers helps to evenly distribute the objects into drives, thus helping with the load balancing, but how?

Instead of having one big hash range for each drive, multiple markers serve to split those large hash range into smaller chunks, and those smaller hash ranges will be assigned to different drives in the server. How does that help?

Let’s say I have 20 objects I want to store, and I still have 4 drives, each with different range of hash values of equal length. But what if out of those 20 objects, maybe 14 are mapped to drive 0, and the rest are equally distributed to drives 1, 2, and 3? This causes the ring to be unbalanced in weight, because drive 0 holds much more hash values than the rest of the drives. This is where the smaller hash ranges can help a lot with load balancing.

As mentioned earlier, consistent hashing algorithm uses multiple markers for the drives to map several smaller ranges of hash values instead of one big range. This has two positive effects: First, if the new drive was to be added, that new drive will gain more objects from all other existing drives in the server, instead of just a few objects from a neighboring drive – this results in more and smaller hash ranges. Likewise, if one of the existing drive was to be removed, all objects that drive was holding onto will be evenly distributed to the other existing drives – results in less and larger hash ranges. Second, by doing this, the overall distribution of objects will be fairly even, meaning the weight among different drives will be very close to evenly distributed – helps with load balancing.

object mapping modified consistent hashing

Picture above shows several objects close to each other in terms of its hash value are distributed among different segments of the different drives. Multiple markers splits 4 big hash ranges into several smaller hash ranges and assigns them into all other drives.

On my next entry, I will talk about how Swift utilizes this algorithm and how it takes a different approach.

[ Tutorial ] Apache ZooKeeper – Setting ACL in ZooKeeper Client

Now let’s talk about setting the ACL of a znode in ZooKeeper. Before getting into the details, let’s talk more about the scheme and ID.

1. Scheme and ID

ID, as name suggests, is an identifier comprised of a username and password. By default, when the znode has an ACL set accessible by a specific group of users or an individual, the <username>:<password> is first hashed using an SHA-1 hashing algorithm, and then it (hex-string) is base-64 encoded.

As mentioned in earlier blog entries, scheme is like a group of users that are authorized to access a certain znode with a scheme-and-id-specific ACL set.

1.1. World Scheme

World scheme has one ID (anyone). This represents any user in the world.

For example, we type in the following command to set the znode accessible by anyone.

setAcl /newznode world:anyone:crdwa

By doing it correctly, You should get something like this in return:


1.2. Auth Scheme

Auth scheme represents a "manually" set group of authenticated users. According to the ZooKeeper documentation (, auth does not utilize any ID. Unless I am mistaken, this seems not to be the case. Because if you try to set ACL on a znode using auth scheme and not provide any ID, it tells you that is not a valid ID, or some form of ID is needed. Below is a (bad) example:

setAcl /newznode auth:crdwa

as seen above, I did not provide any form of ID. This is what I get:


A correct way to use this scheme would be as follows:

setAcl /newznode auth:username:password:crdwa

Using auth scheme allows us to have multiple authorized users to access a single znode with the different username and password combination. Say we have 3 users:

username : password
user_123 : pwd_123
user_456 : pwd_456
user_789 : pwd_789

We can use the same syntax above by replacing username with user_123, user_456, or user_789 and password with pwd_123, pwd_456, or pwd_789 respectively.

1.2.1 addauth Command

One important thing to note is you must use the addauth command before proceeding to set the ACL of a znode using the auth scheme. If you try to set the ACL before executing the addauth command, you will get an error as below:


Correct way to do is to execute addauth command first, and then execute the setAcl command. Below is the syntax of command execution for addauth:

addauth /<node-name> digest <username>:<password>

By adding the authenticator and setting ACL accordingly, you can ensure that you set the ACL correctly.


Repeat the steps for additional username and password combo, and the ACL for that newznode looks like this:


1.3. Digest Scheme

Digest scheme represents an individual user with authentication. This uses username:password string that is hashed using the SHA-1 hashing algorithm, and that hashed string is in turn base64 encoded. According to the ZooKeeper website, it is stated that the MD5 hash of <username>:<password> is used as an ACL ID identity. Unless I am mistaken, that seems not to be the case. Instead, what I found was that <username>:<Base64 encoded SHA-1 hash of username:password> is used as an ACL ID (Please see above pictures under the Auth section).

What’s really funny is that if I authenticate an individual user on a znode using digest scheme on ZooKeeper client, instead of storing the username and encoded hash string of <username:password> like it should, it stores a clear, human-readable text of <username:password> as an ID. Executing the addauth command before setting the ACL with digest scheme does not work either. Below is the picture that illustrates my point:


Unless it is easy to work backwards – decoding the user_abc:pwd_abc, and then take that decoded string and undo the SHA-1 hashing part, it turns out setting ACL using digest scheme on a znode in ZooKeeper client is pointless.

Good thing is that if you setAcl a znode using digest scheme via client, you can delete it.

1.4. Host Scheme

Host scheme represents anyone within the same hosting server. I have not done enough with the host scheme yet, but I will come back to this with more details.

1.5. IP Scheme

IP scheme represents any user within the same IP address. Easiest example to use in this case would be, which represents the user of that any local machine, since any local machine will have point to the localhost. Below is the syntax of setAcl using IP scheme:

setAcl /<node-name> ip:<IPv4-address>:<permission-set>

Using the syntax above, below is an example using the IP address:

setAcl /newnode ip:

If done correctly, you should get the znode stat like the picture below:


That is it for now. On my next blog post, I will briefly talk about how to access them in Java; furthermore, I will talk more in detail about how username and password are stored. Thanks for reading as usual, and happy zookeeping!

[ Tutorial ] Apache ZooKeeper ACL (Access Control List) Getting Permission Sets

Today, I will talk about the basics of ACL in ZooKeeper and getting the permission sets of ACL.

1. What is ACL?

ACL (Access Control List) is basically an authentication mechanism implemented in ZooKeeper. It makes znodes accessible to users, depending on how it is set. For example, if its scheme is set to world and ID set to anyone, then it is accessible by anyone in the world, thus the world scheme and anyone ID. However, if the scheme is anything other than the world, then it’s a different story. Let’s talk about the basics and its attributes first.

A typical permission set of a znode looks something like this: crdwa. This is actually an acronym (can be in any order) that stands for: Create, Read, Delete, Write, and Admin.

1.1. Getting the ACL of a Znode

To get the ACL of a particular znode, we execute the getAcl command in the ZooKeeper client.

It will return that znode’s ACL in this format:

'[ scheme ],'[ id ]
: [ permission-set ]

Syntax of the getAcl command is: getAcl Path

Example: getAcl /getmyacl

Think of the scheme as more like a specific group of users. The world scheme would represent everyone in the world, literally. There are also different schemes in ZooKeeper, which are digest (individual user with unique username and password), ip, which is an individual or group of users within the same IP address, and host, which is a group of users within the same host.

ID I believe is self-explanatory; should the scheme be world, then ID always has to be anyone. There is no point to restrict specific users if it is meant to be viewed by anyone.

Here, I have an example of getting the ACL of the getmyacl znode. By typing in the command getAcl /getmyacl, you will get something like this:


1.2. More About Permission Set

Notice how the permission set says crdwa. If you were trying to get the permission set of a znode in Java, you would get an integer value in return.

First off, you would call getPerms method to get the permission set of a znode in Java. As mentioned earlier, it returns an integer value. In this case, with this znode having a permission set of crdwa, in Java it returns 31, meaning that the user is authorized to create a child znode, read data of that znode, delete that znode, overwrite (or set data) the znode, and has administrative rights of that znode.

Each permission (create, read, delete, write, admin) is actually a bit, either 0 or 1, where 0 represents not allowed, and 1 represents allowed. So, if you convert that 31 into a binary number, you would get 11111. Refer to the following bullet points:

  • Read – 2^0
  • Write – 2^1
  • Create – 2^2
  • Delete – 2^3
  • Admin – 2^4

Say we have a getmyaccl znode. Create, read, and admin are allowed, but delete and write are not. According to my little bullet points above, in Java it would return 21 for the permission set. Convert that to binary, we get 10101 ( (2^4 = 16) + (2^2 = 4) + (2^0 = 1) ) = 21

Let’s try to change its permission set to cwa (create, write, admin) and see what integer value is returned in Java.

This time it returned 22, or 10110 ( (2^4 = 16) + (2^2 = 4) + (2^1 = 2) ) = 22

To get the permission set of a znode, we need to import ACL class (from ZooKeeper package) and ArrayList. First, we need to create an instance of ArrayList that can store ACL object, and create a new instance of ACL object, assign that to the first element of the ArrayList. What’s interesting is that ArrayList contains only one element. Following is the code snippet on how to get the permission set of a znode:

List acl = new ArrayList(); // create new instance of ArrayList to store ACL object
acl = zk.getACL("/getmyacl", stat); 
ACL aclElement = acl.get(0);
System.out.println(aclElement.getPerms()); // for printing the permission set on the screen.  
                                    // this is also how I get 21 and 22 earlier for the permission set.  

When creating a znode the simplest way, any user is authorized the full crdwa permission set. I will talk more about setting the permission set in 2 different blog entries: First one will be the easy; where any user can access the znode. Second one will be tricky (also to talk about and explain), which involves an individual’s username and password, group of users within the same host or the IP address.

This sums up how to get the permission set of a znode. As usual, thanks for reading, and happy zookeeping!

[ Tutorial ] How to Install And Setup Apache ZooKeeper Standalone (Windows)

Today, I will talk about how to install Apache ZooKeeper and run the instance of it.

Prerequisites: JRE 1.6 or higher required (for development purposes, I would install JDK instead 1.6 or higher) At the time of writing this blog, the current stable version of Java is 1.8 and that should work perfectly fine (I have 1.7.0_51)

NOTE: I noticed that some of my peers tend to forget to set the environment variables, so please remember to set them before proceeding.

1. Installing Apache ZooKeeper

1. Download Apache ZooKeeper. You can choose from any given mirror –
2. Extract it to where you want to install ZooKeeper. I prefer to save it in the C:\dev\tools directory. Unless you prefer this way as well, you will have to create that directory yourself.
3. Set up the environment variable.

  • To do this, first go to Computer, then click on the System Properties button.
  • image-003

  • Click on the Advanced System Settings link to the left.
  • image-004

  • On a new window pop-up, click on the Environment Variables... button.
  • image-005

  • Under the System Variables section, click New...
  • For the Variable Name, type in ZOOKEEPER_HOME. Variable Value will be the directory of where you installed the ZooKeeper. Taking mine for example, it would be C:\dev\tools\zookeeper-3.x.x.
  • zook env var 1

  • Now we have to edit the PATH variable. Select Path from the list and click Edit...
  • It is VERY important that you DO NOT erase the pre-existing value of the Path variable. At the very end of the variable value, add the following: %ZOOKEEPER_HOME%\bin; Also, each value needs to be separated by semicolon.
  • zook env var

  • Once that’s done, click OK and exit out of them all.

That takes care of the ZooKeeper installation part. Now we have to configure it so the instance of ZooKeeper will run properly.

2. Configuring ZooKeeper Server

If you look at the <zookeeper-install-directory> there should be a conf folder. Open that up, and then you’ll see a zoo-sample.cfg file. Copy and paste it in the same directory, it should produce a zoo-sample - Copy.cfg file. Open that with your favorite text editor (Microsoft Notepad should work as well).

Edit the file as follows:


NOTE: you really don’t need lines 2 (initLimit=5), 3 (syncLimit=5), and 6 (server.1=localhost:2888:3888). They’re just there for a good practice purposes, and especially for setting up a multi-server cluster, which we are not going to do here.

Save it as zoo.cfg. Also the original zoo-sample.cfg file, go ahead and delete it, as it is not needed.

Next step is to create a myid file. If you noticed earlier in the zoo.cfg file, we wrote dataDir=/usr/zookeeper/data. This is actually a directory you’re going to have to create in the C drive. Simply put, this is the directory that ZooKeeper is going to look at to identify that instance of ZooKeeper. We’re going to write 1 in that file.

So go ahead and create that usr/zookeeper/data directory, and then open up your favorite text editor.

Just type in 1, and save it as myid, set the file type as All files. This may not be insignificant, but we are going to not provide it any file extension, this is just for the convention.


Don’t worry about the version-2 directory from the picture. That is automatically generated once you start the instance of ZooKeeper server.

At this point, you should be done configuring ZooKeeper. Now close out of everything, click the Start button, and open up a command prompt.

3. Test an Instance of Running ZooKeeper Server

Type in the following command: zkServer.cmd and hit enter. You should get some junk like this that don’t mean much to us.

zookeeper fire up

Now open up another command prompt in a new window. Type in the following command: zkCli.cmd and hit enter. Assuming you did everything correctly, you should get [zk: localhost:2181<CONNECTED> 0] at the very last line. See picture below:

zookeeper client

If you are getting the same result, then you setup ZooKeeper server correctly. Thanks for reading, and happy zookeeping!

[ Tutorial ] ZNode Types and How to Create, Read, Delete, and Write in ZooKeeper (Java)

Please click here to read about Create, Read, Delete, and Write znodes in ZooKeeper Client.

This is more like a continuation of my blog entry about creating, deleting, reading, and writing a znode in ZooKeeper Client. It will be mostly a few snippets of code. Although there are multiple ways, I will only present the simplest and easiest ways to create, read, delete and write znodes.

1. Creating a Znode

Java’s ZooKeeper create method takes in 4 parameters: Path [String], Data [byte array], Access Control List [ArrayList] and Mode [CreateMode]. Enclosed in the square brackets is the object type each parameter is.

As mentioned few times in my earlier blog entries, path always needs to start with the root znode ( / ). Remember, the path consists of the root znode and the name of the znode you want to create. For example, to create a znode called new_znode, you would write /new_znode for its path.

Data of a znode is stored in byte array. For the learning purposes, we will use the getBytes() method, which is a String method that converts String into byte array. Size of each znode is 1MB.

Access Control List (ACL) is stored in an ArrayList. We will use the final variable in the Ids class, which we will need to import. Furthermore, I will talk about the ACL more in detail. But for now, we will just stick with the Ids class.

Mode is the types of znodes I mentioned on my last blog entry: persistent, ephemeral, and sequential. We will be importing the CreateMode class to do this. There are four modes: Persistent (default), Persistent-Sequential, Ephemeral, and Ephemeral-Sequential. For now, we will use the default mode.

Example create method:

public void createZnode(String path, byte[] data) throws KeeperException, InterruptedException {
    zk.create(path, data, Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);

create method throws KeeperException and InterruptedException. To anticipate the possibility of throwing either exceptions, surround the create method in a try...catch block.

Another example create method:

try {
    zk.create("/new_znode", "new znode".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
} catch(InterruptedException intrEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" already exists!");
} catch(KeeperException kpEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" already exists!");

2. Reading a Znode

To read a znode, we call the getData method. It takes 3 parameters: Path [String], watch [boolean], stat [Stat].

Again, it takes in the path to specify which znode you want to read. Watcher is relatively a simple concept, but I will talk more about this later. Stat contains a more in-depth information about the znode, such as number of children, when it was created, etc. Like the create method, this also throws InterruptedException and KeeperException.

Example getData method:

public byte[] readZnode(String path) throws KeeperException, InterruptedException {
    return zk.getData(path, true, zk.exists(path, true));

Another example getData method using try...catch block:

try {
    zk.getData("/new_znode", true, zk.exists("/new_znode", true));
} catch(InterruptedException intrEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" does not exist!");
} catch(KeeperException kpEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" does not exist!");

As you may have noticed, getData takes in a Watcher as part of the parameter. Watcher in a nutshell is a mechanism implemented in the ZooKeeper that keeps watch of the znode you specified; it serves to notify the client whether it has been deleted, its data has changed, or if there was any changes to that znode, it will notify the client. One thing to note is that once the Watcher notified the client of a watch_me znode (for example), a new Watcher needs to be set on that same znode again, or else it will not notify the client for the second time.

3. Writing (or Re-Writing) to a Znode

We use setData for writing to a znode. This method takes in 3 parameters: Path [String] as always, data [byte Array] that will overwrite the pre-existing data, and the version [int].

We have a new parameter (but fairly self-explanatory), which is version. Every time the znode gets updated, it makes sense to update its version. Because of this, if you try to pass in the integer value that is not a current version, it will throw a BadVersionException. Here is an example:


I have a znode named newznode and its dataVersion is 5. It would be a major hassle to go back to the code to update its version manually every time the client tries to update its data. Instead, we utilize the getVersion method (from Stat class) and pass that as an argument as follows:

public void writeZnode(String path, byte[] data) throws KeeperException, InterruptedException {
    Stat stat = zk.exists(path, true);
    zk.setData(path, data, stat.getVersion());

As usual, another example of setData method using the try...catch block:

try {
    Stat stat = zk.exists("/new_znode", true);
    zk.setData("/new_znode", "new data".getBytes(), stat.getVersion());
} catch(InterruptedException intrEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" does not exist!");
} catch(KeeperException kpEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" does not exist!");

Now, let’s try to pass in a random integer other than its version. Same scenario (except its version is now 6).

error version

Instead of passing in the stat.getVersion() like I should be, I’ll try to pass in a non-6 value, and you should expect to see this:

exception error

4. Deleting a Znode

To delete a znode, first we use the delete method (I’m sure you could have guessed that much), and it takes in 2 parameters: Path [String] and Version [int]. Again, because I have already went over these two parameters, I will move straight into the code. Version parameter in this method is exactly the same as the version from the setData method.

public void deleteZnode(String path) throws KeeperException, InterruptedException {
    Stat stat = zk.exists(path, true);
    zk.delete(path, stat.getVersion());

And finally, another (and the last) example using the try...catch block:

try {
    Stat stat = zk.exists("/new_znode", true);
    zk.delete("/new_znode", stat.getVersion());
} catch(InterruptedException intrEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" does not exist!");
} catch(KeeperException kpEx) {
    // do something to prevent the program from crashing
    System.out.println("\"new_znode\" does not exist!");

Thanks for reading. Happy zookeeping

OpenStack Swift Introduction

I heard about the OpenStack technology while working as an intern. My former supervisor, Kevin Stoll, introduced the interns and myself the OpenStack, with him being a huge fan of cloud architecture, automation, and other web service related technologies (He is also big on RESTful).

OpenStack technology has multiple open-source software, most of them which can be found in their own Github repository. Out of them, I will talk about Swift today.

What is OpenStack Swift?

OpenStack Swift (or just Swift, also known as StackSwift) is a powerful object storage system that is designed to store objects, such as media, documents, data, backups, projects, etc. Swift is a highly scalable software, exceptionally high that they claim it is able to “scale to thousands of machines providing hundreds of Petabytes of storage distributed in geographically distant regions.” (from With a claim like this, it is seemingly guaranteed that Swift is able to scale horizontally with zero failures.

Swift utilizes the RESTful API (Representational State Transfer Application Programming Interface), which enables users to create, write (or overwrite), delete and read objects, as well as metadata, primarily for indexing and searching objects.

Now let’s talk about some of the characteristics that makes Swift attractive to big companies:

  • Swift is open-source software and freely available (
  • Ideally runs on Linux OS and other x86 servers
  • Comparable to Amazon S3 (Amazon Simple Storage Service) in terms of scalability and eventual consistency
  • All objects have its own URL for access via browser
  • All objects are accessed and administered via RESTful HTTP API
  • Swift stores multiple copies of objects (to prevent loss should server crash)

How is/What makes Swift Strongly Consistent?

No one cannot stress enough how important it is for the object storage to maintain strong consistency. Swift is a prime example; it will want to prevent the loss of objects and/or data, and keep them from being corrupt as long as possible. Otherwise, it would defeat the purpose of Swift being an object storage software.

Swift protects its objects by keeping multiple copies of object throughout multiple nodes. This allows clients to access objects, should a node, or even multiple nodes fail. The way Swift is designed ensures all copies of stored objects to be most up-to-date. This way, should the node(s) fail, users will still have access to a good copy of objects. As the number and volume of objects grow, more distribution across multiple regions occur; this in turn allows Swift to maintain its already strong consistency.

What Are My Plans Regarding Swift

First things first: learn more about Swift, study the code, and probably make my own object storage server. As a developer, I would like to make some contributions, no matter how small it may be. Before working as an intern and being introduced to these concepts, I have always wanted to create my own home cloud server. Back then, I wanted to make my own tutorial videos and host them in my own server for other visitors to stream. It may be a small, but centralized start, I believe that Swift is an optimal solution for it. Being a developer, I am also planning to make contributions to this software and utilize it in my own home server later in the future. In addition to that, I will write more blog entries to talk more about Swift.