IBM InfoSphere DataStage Introduction

I recently started my new job and learned a little about the IBM InfoSphere DataStage software, as well as doing some research about it.

IBM InfoSphere DataStage (I’ll refer to this as DataStage from now on for short) is an ETL (Extract, Transform, Load) tool used for data integration, regardless type of data. It utilizes a high performance parallel framework for simultaneous processing. Thanks to its highly scalable platform, it can provide highly flexible data integration, including the big data both in static and dynamic on a distributed file systems.

DataStage uses client-to-server design, where the client would constantly communicate with the server to create jobs (up to this point, I have been creating a lot of Parallel Jobs, which I will talk more about in detail), and they are administered against the central repository located in the server.

Repository In a Nutshell

Every time a client access the server, they are faced with the repository panel on the left side, by default. Basically it is a set of directories, or more like a file system (very similar to that of ZooKeeper) where the clients would basically administer batch(es) of job(s). In addition to the batches of jobs, this is also where the data schema are stored, as well as the actual data that are read off of. I believe there is a lot more to it, but as far as my experience goes, that’s what I was able to get out of.

Key Strengths, Benefits and Features

DataStage apparently is an ideal tool for data integration, whether it be a so-called big data, or the system migration. Any job the client creates, it is able to import, export, or create the metadata. Somewhat similar to the operating systems, it is able to administer the jobs; such as scheduling each and individual jobs, monitoring the progress of jobs executed through log, and just executing (or running) the jobs. Also, thanks to its graphical notation of data integration, development and job execution can be administered in a single environment.

Additional Benefits and features of DataStage includes, but not limited to:

  • Due to its high scalabiltity, this allows the integration and transformation of large chunks and volumes of data.
  • Able to directly access big data on a distributed file system. Also the JSON support and new JDBC connector are both provided.
  • High speed for the flexibility and effectiveness of the building, deploying, updating, and managing the data integration.

Relationship With Apache Hadoop and ZooKeeper?

DataStage's relationship to the Apache Hadoop (which ZooKeeper is also linked, being started out as the sub-project of Hadoop) caught my attention. After further research, I learned that DataStage's ability to integrate big data, utilizing the concept of parallelism is based on Hadoop. In addition, ZooKeeper is well known for easing the distributed system builds. DataStage is able to integrate big data statically or in motion on a distributed and mainframe platforms.

Advertisements

[ Tutorial ] ZNode Types and How to Create, Read, Delete, and Write in ZooKeeper (via zkClient)

Please click here to read about Create, Read, Delete, and Write znodes in Java.

About a month ago, I wrote a blog entry about how to connect to a ZooKeeper.  Now, I will talk about how to create, read, delete, and write the znodes, which are what I will refer to as the permission sets later on.  First way is to do it via command prompt (in Windows) as a client, and on my next blog entry, I will talk about how to do them in Java.

Each znode has its own permission sets.  They are: Create, Read, Delete, Write, and Admin (abbreviated CRDWA).  I will talk more in detail about the permission sets, how to set them, as well as the access control list (ACL) in the later blog posts.

1. Types of Znodes

Before I get into creating them, let’s briefly talk about the types of znodes: persistent, ephemeral, and sequential.

1.1. Persistent Znodes

These are the default znodes in ZooKeeper.  They will stay in the zookeeper server permanently, as long as any other clients (including the creator) leave it alone.

1.2. Ephemeral Znodes

Ephemeral znodes (also referred as session znodes) are temporary znodes.  Unlike the persistent znodes, they are destroyed as soon as the creator client logs out of the ZooKeeper server. For example, let’s say client1 created eznode1.  Once client1 logs out of the ZooKeeper server, the eznode1 gets destroyed.  

1.3. Sequential Znodes

Sequential znode is given a 10-digit number in a numerical order at the end of its name.  Let’s say client1 created a sznode1.  In the ZooKeeper server, the sznode1 will be named like this:

sznode0000000001

If client1 creates another sequential znode, it would bear the next number in a sequence.  So the next sequential znode will be called <znode name>0000000002.

2. Znode “Anatomy”

Each znode has variety of different stats, such as its path, name, data it stores, when it was created, its own permission sets, etc.  For the sake of simplicity and demonstration, I will only talk about the path, name and data.

Path of all znodes will ALWAYS start with the root, or a slash symbol.  In this example, the path of znode1 would be as follows: /znode1, and its name is znode1.  All znodes consist of data (it can also be blank).  Keep in mind that data is stored in a byte array, because this will be important to know by the time we deal with znode data in Java.

In order to do anything with a znode, you must specify its path.  If you only specify its name, it will tell you about the syntax error, so keep that in mind.

3. Creating a Znode

As mentioned earlier, znodes are persistent by default.  In a ZooKeeper client, we type in commands to perform different actions with the znode, such as create, delete, update its data, etc.

To create a znode, we need to specify its path (see above).  Now remember, a path of any znode ALWAYS starts with the root znode.  The command syntax for creating a znode is as follows:

create -<options> /<znode-name> <znode-data>

With that in mind, following are the examples for creating different types of znodes:

Persistent (Default): create /znode mydata

Ephemeral: create –e /eznode mydata

Sequential: create –s /sznode mydata

Each znode can also have a child znodes; this really depends on the permission set of that particular znode.  For instance, if a nochild znode has a RDWA permission set, where create is not allowed, then nochild znode cannot have any children znodes.  Please note that in the following syntax:

create /<parent-znode>/<child-znode> <child-znode-data>

the <parent-znode> MUST exist in order to create the <child-znode%gt;; otherwise, it will not work.

** NOTE: ephemeral znodes CANNOT contain any child znode!  Because ephemeral znodes are temporary, they will be destroyed should the creator client logs out of the ZooKeeper server, meaning all children znodes under that ephemeral znode will be destroyed automatically as well. To prevent that, ephemeral znodes cannot bear any child znode.

4. Deleting a Znode

There are 2 ways to delete a znode.  First way is to just typical deletion, and second way is recursive deletion.

delete /<znode>

We have a deleteme znode, and as you may have guessed, following command deletes the deleteme znode:

delete /deleteme

If we wanted to delete a child znode (i_am_bug in this example) instead, we just need to specify the path of that child znode, like so:

delete /i_have_bug/i_am_bug

Second way to delete a znode is recursive deletion.  This method is necessary to delete a znode with child(ren) znode(s), because using the delete command will not work.

Recursion in general is taking a big problem, and taking baby steps to solve it.  It’s also known as the “divide and conquer” approach of solving the problem.  In this case, it will start deleting znodes at the lowermost level first, one by one, then work its way up.

rmr /<znode-with-child>

I really like this command, because it works with znodes without child(ren) z(s) as well. In that case, might as well use rmr command for any other znodes, just be careful not to delete the child znode you need to keep.

5. Reading a Znode Data

Here, we use the get command to fetch the data of that particular znode.  As always, specify the path of the znode you want to get the data off of.

get /<znode-name>

This command will also return what’s also known as the stat.  Data is located on the top line.  Stat provides the detailed information of the znode, such as when it was created/re-written, version, number of children it contains, etc.

The only time get command will not work on a znode is if the read permission is not allowed in the permission set, or if the znode has ACL (Access Control List) set to digest, hosts (depends on the hosts), or IP (also depends on the IP).  I will talk more about the ACL in my later blog posts.

6. (Re) Writing a Znode Data

We use the set command to overwriting the znode data.  As always, specify the path of the znode you want to overwrite.

set /<znode-name> <new-data>

Once executed, it will return the stat of the znode, excluding the new data you have set.  Like the get command, the set command will not work if the write permission is not allowed in the permission set, or if its ACL has been configured accordingly.

By now, you should be able to execute basic commands for the znodes. In the next blog post, I will talk about how to do them in Java.

[ Tutorial ] Maven – Install Maven and Create New Maven Project via Command Prompt

This tutorial covers how to install Maven and creating a Maven project.  At the time of this writing this entry (May 28, 2014) the stable version of maven is 3.2.1.

1. Download and Install Apache Maven

1.1. Download Maven

First things first: download maven (http://maven.apache.org/download.cgi)

Under the section that says This is the current stable version of Maven, click on “apache-maven-3.2.1-bin.zip”.  It will download a zip file that you will extract.  This needs to be installed manually.  What I prefer to do is create a folder called dev in a C drive (I’m using Windows 7), and inside that folder, I create another folder called tools.  This is where I extract all other tools.

1.2. Set up System Environment Variables

Next step is to set the environment variable for maven.  Open up ‘Computer’, then on top of the window, click on the button that says ‘System Properties’.

image-003

To the left is a link that says ‘Advanced System Settings’, it will bring a pop up window.

image-004

At the bottom right of the new pop up window, click on a button that says ‘Environment Variables…’ and that will bring another pop up window.

image-005

Under the System Variables section, we need to create a new variable; this tells your computer which directory to look at to run maven.  Click ‘New…’ and do the following:

Variable Name: MAVEN_HOME
Variable Value: C:\dev\tools\apache-maven-3.2.1 (Only if you extracted maven in the C:\dev\tools directory; different if you extracted it elsewhere)

pic

Click OK.  Still under the System Variables section, scroll down until you find PATH variable, and click ‘Edit…’

Environment variables, such as PATH, can also have multiple values; each value needs to be separated by a semicolon. At the end of the Variable value, add a semicolon if there isn’t any, and type the following:

%MAVEN_HOME%\bin;

Please do not delete the variable value that was already set for the PATH variable. If you do, other programs dependent on that value might not work properly! If you’re doing it correctly, it should look something like this:

pic2

(WARNING: this does not check for the syntax.  So if you typed it wrong, your computer will not know where to look for to run maven!)

At this point, you are all set and ready to run maven and create a new maven project.  Click OK.

1.3. Test for Successful Installation

Now open up a command prompt (click start button, type in cmd and hit enter)

run the following command:

mvn -version

… and you should see the following:

image-006

if you do, you installed maven successfully.  Now let’s create a maven project.

2. Create New Maven Project

Creating a maven project is ridiculously easy.  Only thing I would worry about is to be careful when typing in the command correctly.

First off, use the cd (change directory) command to go to the directory where you will be creating a project.  What I prefer personally is to create it in C:\dev\workspace directory.  After you open up the command prompt, type in cd \dev\workspace or whichever directory you decide to work on.

image-007

Now type in the following command:

mvn archetype:generate -DgroupId=com.appname.app -DartifactId=app-name -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

and hit enter.  It should create a new project.  At the end, the command prompt will read BUILD SUCCESS.  Next step is to build the project package.  Change the directory to app-name (cd app-name) and type in the following command:

mvn package

and hit enter.  Again, at the end of it will read BUILD SUCCESS. At this point, you’re all set.  Thanks for reading, and happy developing.

[ Tutorial ] Maven – Connect to a ZooKeeper in Java

This tutorial will cover how to connect to ZooKeeper in Java with Maven project.

(Click here to download the source code for this tutorial)

1. What is Apache Maven?

Apache Maven (or just Maven) is simply “a software project management and comprehension tool.” (from http://maven.apache.org)  In this case, Maven is designed to help you with downloading what’s called dependency – think of them as the external JAR files you need to explicitly download and include them in your build path.

1.1. Create Maven Project

Maven utilizes POM (Project Object Model); this is what makes Maven very useful.  You specify the required dependencies here, and after updating the project, it will automatically download the dependencies for you.

First step is to create a Maven project.  (How to create Maven project). Under the maven project you just created, you will see an xml file called pom.xml.

1.2. Configure Project

Image

Open that, and edit it as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>zook</groupId>
    <artifactId>zook</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>
    
    <name>zook</name>
    <url>http://maven.apache.org</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.zookeeper</groupId>
            <artifactId>zookeeper</artifactId>
            <version>3.4.6</version>
        </dependency>
    </dependencies>
</project>

Right click on your project > Maven > Update Project (or Alt + F5).  This usually works for me if “Force Update of Snapshots/Releases” is checked.

Image

Then click OK.

This enables you to use any classes included in that zookeeper dependency.

2. Connect to ZooKeeper

2.1. ZooKeeper Connector Class

Next step is to create a new class, let’s call it ZkConnector.

Here, you are going to import 6 following classes:

import java.io.IOException;
import org.apache.zookeeper.WatchedEvent;
import org.apache.zookeeper.Watcher;
import org.apache.zookeeper.Watcher.Event.KeeperState;
import org.apache.zookeeper.ZooKeeper;
import org.apache.zookeeper.ZooKeeper.States;

ZkConnector class is for your program to interact with ZooKeeper server.  We will declare two global variables here:

ZooKeeper zookeeper;
java.util.concurrent.CountDownLatch connectedSignal = new java.util.concurrent.CountDownLatch(1);

and write three different methods.  First method connects to the ZooKeeper server by passing a hosts argument.  Another method tells ZooKeeper server to shut down, and the third method gets the instance of live ZooKeeper server.

2.2. Connect method

public void connect(String host) throws IOException, InterruptedException {
    zookeeper = new ZooKeeper(host, 5000, 
                              new Watcher() {
                                  public void process(WatchedEvent event) {
                                      if (event.getState() == KeeperState.SyncConnected) {
                                          connectedSignal.countDown();
                                      }
                                  }
                              });
                              connectedSignal.await();
}

2.3. Close method

public void close() throws InterruptedException {
    zookeeper.close();
}

2.4. Getter method:

public ZooKeeper getZooKeeper() {
    if (zookeeper == null || !zookeeper.getState().equals(States.CONNECTED)) {
        throw new IllegalStateException("ZooKeeper is not connected.");
    }
    return zookeeper;
}

3. Test ZooKeeper Connection

3.1. Tester Class

Let’s test this class to see if it works.

We’re going to create another class, call it ZkConnectTest. This class will import 5 classes as follows:

import java.io.IOException;
import org.apache.zookeeper.CreateMode;
import org.apache.zookeeper.KeeperException;
import org.apache.zookeeper.ZooDefs.Ids;
import org.apache.zookeeper.ZooKeeper;

Many methods in the zookeeper API throw exceptions.  We need to add a throws declaration to the main method of three exceptions: IOException, InterruptedException, KeeperException.  Another way is to enclose the exception throwing statements with try... catch, but we’ll make our main method throw those exceptions.

First, let’s declare a new ZooKeeper variable, call it zk.

ZooKeeper zk;

Next, create a new instance of the ZkConnector class, let’s call it zkc.

ZkConnector zkc = new ZkConnector();

In the last 3 lines, we are going to:

  • connect to the zookeeper
  • let zk be the instance of the zookeeper connected
  • and create a new znode, just to see if it worked correctly.
zkc.connect("localhost");
zk = zkc.getZooKeeper();
zk.create("/newznode", "new znode".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);

3.2. Run an Instance of ZooKeeper

Save your source code and let’s create the ZooKeeper instance.  We need to have an instance of ZooKeeper running in order for your program to work. Open up a command prompt (if you’re using windows), or a terminal (Mac or Linux users) and type in the following command:

Windows users: zkServer.cmd
Mac/Linux users:  ./zkServer.sh -start

3.3. Interact With ZooKeeper as Client

Next, we are going to connect to the zookeeper as a client.  Open up another command prompt (windows) or a terminal (Mac or Linux) and type in the following command:

Windows users: zkCli.cmd
Mac/Linux users: ./zkCli.sh -server:2181

In the client window, type in the following command: ls /

If you have not created any znodes yet, you should only get zookeeper znode.  This is implemented in the zookeeper by default, and it cannot be destroyed.  What we just did was print a list of all children znodes of the root (/) znode.

Go back to the program we wrote, and run it.

Then go back to the client window, and type in the ls / command.  If you see the newznode, then we successfully wrote a program to connect to the zookeeper. Thanks for reading, and happy zookeeping.

(Click here to download the source code for this tutorial)