Hadoop Environment Setup
Hadoop installation on Linux
This tutorial demonstrates how to install Hadoop in a Linux environment. While running Hadoop on Windows or Mac is possible, Hadoop officially supports only GNU/Linux and Windows. Linux is also preferred for it's configuration flexibility and existing documentation.
If you are using something other than Linux, we recommend installing a Linux VM to follow this tutorial. If you don't want to use a VM, understand that the following configuration steps are similar for other OS implementations.
Adding a user
It's best practice to create a separate user for your Hadoop instance. This isolates Hadoop's file system from other file systems on the machine. Run the following commands in the Linux terminal:
$ sudo su
# useradd hadoop
# passwd hadoop
As the root user, you can add a user via useradd <username>. You can then add a password via passwd <username> command.
Configuring SSH
Hadoop requires SSH to perform operations on a cluster of shared server nodes. The Hadoop user needs password-less login capabilities for accessing these nodes.
For these reasons, you must generate a public/private key pair that is then shared across the cluster.
To generate the key:
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
This will generate the a SSH key/value pair, move it to the authorized_keys folder, and grant the necessary permissions.
Downloading Java
Java is required for Hadoop. You'll want to find the latest stable version of the Java JDK and install it on your machine.
1. Download Java
To download the latest JDK, visit Oracle. Find the latest stable JDK available and download and extract.
2. Move JDK to usr/local
After extracting, you'll want to move the jdk folder to /usr/local. This makes Java available to all users on the system.
3. Update Path
Finally, you'll want to update your PATH variable. To update this through the command line, run the following:
$ echo 'export JAVA_HOME=/usr/local/jdk1.7.0_71' >> ~./bashrc
$ echo 'export PATH=$PATH:$JAVA_HOME/bin' >> ~./bashrc
This creates an environment variable for JAVA_HOME and adds it to the class path via PATH=$PATH:$JAVA_HOME/bin'.
4. Apply changes
To apply the changes, run:
$ source ~/.bashrc
5. Confirm Installation
To make sure everything is working, run java -version and you should see similar output:
java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) Client VM (build 25.144-b01, mixed mode)
Download Hadoop
You will follow a similar process for downloading Hadoop. The steps below use wget to easily retrieve the latest tar.gz file.
1. Navigate to usr/local
$ cd /usr/local
Like Java, you want to install Hadoop in the /usr/local directory. You can now download and extract Hadoop from this directory.
2. Download Hadoop
Run wget to download the latest stable version of Hadoop.
# wget http://apache.claz.org/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz
# tar xzf hadoop-2.7.4.tar.gz
# mv hadoop-2.7.4/* to hadoop/
This will retrieve the Hadoop v2.7.4 and extract it with tar. Notice how you've also moved everything to another directory hadoop for easy reference.
3. Update Path
Update your ./bashrc file with Hadoop environment variables using the same technique...
$ echo 'export HADOOP_HOME=/usr/local/hadoop' >> ~./bashrc
$ echo 'export PATH=$PATH:$HADOOP_HOME/bin' >> ~./bashrc
4. Apply changes
To apply the changes, run:
$ source ~/.bashrc
5. Confirm Installation
To make sure everything is working correctly, run hadoop version and you should see something like this:
Hadoop 2.7.4
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1523454
Compiled by hortonmu on 2017-08-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum
Conclusion
Your Linux environment is now configured for Hadoop. Next, we'll look at configuring Hadoop and HDFS.