IBM InfoSphere DataStage Introduction

I recently started my new job and learned a little about the IBM InfoSphere DataStage software, as well as doing some research about it.

IBM InfoSphere DataStage (I’ll refer to this as DataStage from now on for short) is an ETL (Extract, Transform, Load) tool used for data integration, regardless type of data. It utilizes a high performance parallel framework for simultaneous processing. Thanks to its highly scalable platform, it can provide highly flexible data integration, including the big data both in static and dynamic on a distributed file systems.

DataStage uses client-to-server design, where the client would constantly communicate with the server to create jobs (up to this point, I have been creating a lot of Parallel Jobs, which I will talk more about in detail), and they are administered against the central repository located in the server.

Repository In a Nutshell

Every time a client access the server, they are faced with the repository panel on the left side, by default. Basically it is a set of directories, or more like a file system (very similar to that of ZooKeeper) where the clients would basically administer batch(es) of job(s). In addition to the batches of jobs, this is also where the data schema are stored, as well as the actual data that are read off of. I believe there is a lot more to it, but as far as my experience goes, that’s what I was able to get out of.

Key Strengths, Benefits and Features

DataStage apparently is an ideal tool for data integration, whether it be a so-called big data, or the system migration. Any job the client creates, it is able to import, export, or create the metadata. Somewhat similar to the operating systems, it is able to administer the jobs; such as scheduling each and individual jobs, monitoring the progress of jobs executed through log, and just executing (or running) the jobs. Also, thanks to its graphical notation of data integration, development and job execution can be administered in a single environment.

Additional Benefits and features of DataStage includes, but not limited to:

  • Due to its high scalabiltity, this allows the integration and transformation of large chunks and volumes of data.
  • Able to directly access big data on a distributed file system. Also the JSON support and new JDBC connector are both provided.
  • High speed for the flexibility and effectiveness of the building, deploying, updating, and managing the data integration.

Relationship With Apache Hadoop and ZooKeeper?

DataStage's relationship to the Apache Hadoop (which ZooKeeper is also linked, being started out as the sub-project of Hadoop) caught my attention. After further research, I learned that DataStage's ability to integrate big data, utilizing the concept of parallelism is based on Hadoop. In addition, ZooKeeper is well known for easing the distributed system builds. DataStage is able to integrate big data statically or in motion on a distributed and mainframe platforms.