Hadoop and Swift integration are the essential continuation of the Hadoop/OpenStack marriage. The key component to making this marriage work is the Hadoop Swift filesystem implementation. Although this implementation has been merged into the upstream Hadoop project, Sahara maintains a version with the most current features enabled.
You may build the jar file yourself by choosing the latest patch from the Sahara Extra repository and using Maven to build with the pom.xml file provided. Or you may get the latest jar pre-built from the CDN at http://sahara-files.mirantis.com/hadoop-swift/hadoop-swift-latest.jar
You will need to put this file into the hadoop libraries (e.g. /usr/lib/share/hadoop/lib) on each job-tracker and task-tracker node for Hadoop 1.x, or each ResourceManager and NodeManager node for Hadoop 2.x in the cluster.
In general, when Sahara runs a job on a cluster it will handle configuring the Hadoop installation. In cases where a user might require more in-depth configuration all the data is set in the core-site.xml file on the cluster instances using this template:
<property>
<name>${name} + ${config}</name>
<value>${value}</value>
<description>${not mandatory description}</description>
</property>
There are two types of configs here:
General. The ${name} in this case equals to fs.swift. Here is the list of ${config}:
Provider-specific. The patch for Hadoop supports different cloud providers. The ${name} in this case equals to fs.swift.service.${provider}.
Here is the list of ${config}:
For this example it is assumed that you have setup a Hadoop instance with a valid configuration and the Swift filesystem component. Furthermore there is assumed to be a Swift container named integration holding an object named temp, as well as a Keystone user named admin with a password of swordfish.
The following example illustrates how to copy an object to a new location in the same container. We will use Hadoop’s distcp command (http://hadoop.apache.org/docs/r0.19.0/distcp.html) to accomplish the copy. Note that the service provider for our Swift access is sahara, and that we will not need to specify the project of our Swift container as it will be provided in the Hadoop configuration.
Swift paths are expressed in Hadoop according to the following template: swift://${container}.${provider}/${object}. For our example source this will appear as swift://integration.sahara/temp.
Let’s run the job:
$ hadoop distcp -D fs.swift.service.sahara.username=admin \
-D fs.swift.service.sahara.password=swordfish \
swift://integration.sahara/temp swift://integration.sahara/temp1
After that just confirm that temp1 has been created in our integration container.
Note: Please note that container names should be a valid URI.