Survey through Hadoop: How to cross the Sahara on top of an elephant and do not die in the way – Part I

In this first part I will explain how to install hadoop in your machine. This explanation is for ubuntu, for other linux distributions you just need to change the code-sources.

Hadoop has several problems of User interface, as well as almost apache things, and that’s why reach the tarball to download is quite hard, but in this post I include a link to the ‘last stable release’

First, ensure that you have some of the basic architecture that Hadoop will need:

$ sudo apt-get install ssh
$ sudo apt-get install rsync
$ sudo apt-get install jsvc # OPTIONAL

Usually, you will have both of them already installed and up-to-date, but, you never know.

After, you just need to fetch hadoop. Today the stable release is 2.4.1, you probably want to check in the ftp the current last stable version, if it changed the following course it may not work.

$ wget http://ftp.cixug.es/apache/hadoop/common/stable/hadoop-2.4.1.tar.gz
$ tar -xvf hadoop-2.4.1.tar.gz
$ mv hadoop-2.4.1 hadoop

If we check the distribution of directories inside the hadoop folder
$ ls

4 drwxr-xr-x 2 santiago santiago 4096 Jun 21 08:05 bin
4 drwxr-xr-x 3 santiago santiago 4096 Jun 21 08:05 etc
4 drwxr-xr-x 2 santiago santiago 4096 Jun 21 08:05 include
4 drwxr-xr-x 3 santiago santiago 4096 Jun 21 08:05 lib
4 drwxr-xr-x 2 santiago santiago 4096 Jun 21 08:05 libexec
16 -rw-r–r– 1 santiago santiago 15458 Jun 21 08:38 LICENSE.txt
4 -rw-r–r– 1 santiago santiago 101 Jun 21 08:38 NOTICE.txt
4 -rw-r–r– 1 santiago santiago 1366 Jun 21 08:38 README.txt
4 drwxr-xr-x 2 santiago santiago 4096 Jun 21 08:05 sbin
4 drwxr-xr-x 4 santiago santiago 4096 Jun 21 08:05 share

We can say that Apache suggest us to untar this on the root directory. This decision is of course up to you, and your choice will have repercussion on the edition of the environment file we will edit now

In order to work, hadoop need some care, and we will give it to it editing the
hadoop/etc/hadoop/hadoop-env.sh file:

$ vim hadoop/etc/hadoop/hadoop-env.sh

In this file we will face several points of service configuration, such as file locations and JVM parameters.

Reading the file you will notice that you can and may want to configure the JAVA_HOME environment variable. If you are not new in java, you may have it already setup in your own environment. This JAVA_HOME is the one that will be used by the hadoop features.

Since I am working with scala using Java 1.7 and hadoop suggest to use 1.6, I will redefine the default JAVA_HOME parameter in this file

export JAVA_HOME=${JAVA_HOME}

to

export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64

Also reading this file you will realize the dependency of hadoop with Jsvc (In case that we want to run secure datanodes). Jsvc is a set of libraries for letting the unix running code to make some black magics (running as root by example).
Thankfully it is available in the ubuntu repositories and we can download it from there, as we already did. So if you need to go secure, install it. In any case, if you are just starting, you can leave it for later.

If you want to keep going with Jsvc, point the variable to the binary

export JSVC_HOME=/usr/bin/jsvc

We are almost done with the deploying, now we just need to move all the folders to the place they will take, since I want to use the apache suggestion, I will run

$ rm hadoop/*.txt
$ sudo cp hadoop/* / -r

Since commands will be located at standar folders, now you can do something like

$ hadoop version

Hadoop 2.4.1
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1604318
Compiled by jenkins on 2014-06-21T05:43Z
Compiled with protoc 2.5.0
From source with checksum bb7ac0a3c73dc131f4844b873c74b630
This command was run using /share/hadoop/common/hadoop-common-2.4.1.jar

Great, is already installed, and working. Is nice to know that by default it use the running machine as single node. So, you can already start to do your first experiments if you are a beginner.

If you want to have a more complex installation in order to have better knowledge about the limitations of hadoop and how to deal with the cloud definition it self, you can start heading to the configuratoin files: core-site.xml, hdfs-site.xml and mapred-site.xml, all of them at /etc/hadoop

$ vim /etc/hadoop/core-site.xml

Here you’ll find the default configuration in all of them, an empty one, just with xml definition and

To add several servers, all of them mapped to your own computer, you just need to add the following configurations:

conf/core-site.xml:

fs.defaultFS
hdfs://localhost:9000

conf/hdfs-site.xml:

dfs.replication
1

conf/mapred-site.xml:

mapred.job.tracker
localhost:9001

We will go deeper into configuration meanings in other posts, by know just believeme :).

We are almost there, now we just need to configure an ssh DSA (not RSA) key installed on our machine. You may have one already, if you haven’t you can do the following:

$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa

And after, add your public key to authorizeds keys:

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Finally, to start the service we need to execute two files located on sbin.

Be prepared to be prompted for the password of root user.. several times. If you are in Ubuntu, you probably never configured it:

$ sudo su
$ passwd
Enter new UNIX password:
Retype new UNIX password:

Then, the promised, execute the services

Starting the distributed filesystem

$ sudo /sbin/start-dfs.sh

14/07/07 01:24:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Starting namenodes on [localhost]
root@localhost’s password:
localhost: starting namenode, logging to ///logs/hadoop-root-namenode-Tifa.out
root@localhost’s password:
localhost: starting datanode, logging to ///logs/hadoop-root-datanode-Tifa.out

Starting secondary namenodes [0.0.0.0]
The authenticity of host ‘0.0.0.0 (0.0.0.0)’ can’t be established.
ECDSA key fingerprint is ed:01:28:4d:70:8f:8f:1b:7f:91:e8:85:61:0a:a2:87.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added ‘0.0.0.0’ (ECDSA) to the list of known hosts.
root@0.0.0.0’s password:
0.0.0.0: starting secondarynamenode, logging to ///logs/hadoop-root-secondarynamenode-Tifa.out
14/07/07 01:25:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

Starting the YARN services

$ sudo /sbin/start-yarn.sh

starting yarn daemons
starting resourcemanager, logging to //logs/yarn-root-resourcemanager-Tifa.out
root@localhost’s password:
localhost: starting nodemanager, logging to //logs/yarn-root-nodemanager-Tifa.out

Thats it :). In the next post we will see some extra configuration, FAQ for troubleshooting and after some basic usages of the hdfs and yarn.

About santiagobragagnolo

Robotics obsessive, computer languages and cloud computing enthusiastic. Metadimensions walker.
This entry was posted in funcional, Java, Uncategorized. Bookmark the permalink.

Leave a comment