Install Hadoop on the Orange Pi

Install Java on Orange Pi

Let’s put the yellow elephant on the Orange Pi! Hadoop is a framework for distributed data storage and processing used in big data applications. We will set up a single node Hadoop cluster on the Orange Pi. This may or may not be practical in a production environment, but it’s a good way to learn to configure Hadoop, load a file on the hdfs and run a MapReduce job.

The first thing to do is to install Oracle Java. That’s because the version of Open Java Development Kit that the Orange Pi comes with is not compatible with Hadoop. Also, Oracle Java will generally run faster than Open JDK on the Orange Pi. At first I upgraded the OpenJDK to the suitable version, then I installed Oracle Java and I noticed a significant increase in performance.

To install Java, visit the Oracle JDK download page. From here select the Linux ARM 32 version, which is the one to use for the Orange Pi Plus 2e. Following the steps from here, unpack the archive:

sudo tar zxvf jdk-8u101-linux-arm32-vfp-hflt.tar.gz -C /opt

Next, run the following commands and select the newly installed Java version:

sudo update-alternatives --install /usr/bin/javac javac /opt/jdk1.8.0_101/bin/javac 1
sudo update-alternatives --install /usr/bin/java java /opt/jdk1.8.0_101/bin/java 1
sudo update-alternatives --config javac
sudo update-alternatives --config java

With the correct Java set up you can now install Hadoop.

Install and configure Hadoop on Orange Pi

In this section I will follow the steps for a Raspberry Pi Hadoop installation, presented on Jonas Widriksson’s blog and on the ‘Because we can geek’ blog. First, I create an additional user, dedicated for running the Hadoop jobs:

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

Next, get the latest Hadoop version:

wget ftp://apache.belnet.be/mirrors/ftp.apache.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Unpack it and assign the newly created user as the owner of the hadoop folder:

sudo tar -xvzf hadoop-2.7.3.tar.gz -C /opt/
cd /opt
sudo chown -R hduser:hadoop hadoop-2.7.3/

Then log in with the hadoop user:

su hduser

Next we need to add some Hadoop environment variables. Let’s do it by adding them to the end of the .bashrc file:

nano ~/.bashrc

Then add:

export JAVA_HOME=/opt/jdk1.8.0_101
export HADOOP_HOME=/opt/hadoop-2.7.3
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin

To apply the changes run:

source ~/.bashrc

We need to configure the hadoop-env.sh file and add the JAVA_HOME variable again.

cd $HADOOP_CONF_DIR 
nano hadoop-env.sh

Here we change the JAVA_HOME variable to:

export JAVA_HOME=/opt/jdk1.8.0_101/

To test that everything went ok you can type:

hadoop version

Next we need to edit some configuration files for Hadoop. The first one is:

nano core-site.xml

Insert the following between the configuration tags:

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/hdfs/tmp</value>
</property>

For setting up a single-node in a pseudo-distributed mode use this:

nano hdfs-site.xml
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

To run a MapReduce job on YARN we need to configure the following files:

cp mapred-site.xml.template mapred-site.xml
nano mapred-site.xml
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

Next we configure YARN. In this file you can adjust the CPU or RAM resources to suit your device:

nano yarn-site.xml
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>4</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-vcores</name>
  <value>1</value>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>4</value>
</property>

Let’s create the Hadoop distributed file system:

sudo mkdir -p /hdfs/tmp
sudo chown hduser:hadoop /hdfs/tmp
chmod 750 /hdfs/tmp
hdfs namenode -format

Next start the hdfs and yarn:

cd $HADOOP_HOME/sbin
start-dfs.sh
start-yarn.sh

To check that everything started properly, type in:

jps

You should see these processes:

jps

Now lets put a file on the distributed file system. I chose to upload the small text file in this Raspberry Pi Hadoop cluster post, so I can compare the execution times. Put the file in you home directory. Then:

cd
hdfs dfs -copyFromLocal smallfile.txt /smallfile.txt

This will copy the file to the hdfs. Let’s run the classic word count example on this file:

hadoop jar /opt/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /smallfile.txt /smallfile-result

You can check the status of your Hadoop jobs at http://localhost:8088/cluster

hadoop-status

The execution time (mm:ss) for the small text file was 01:25, which is an improvement from the 2:17 on one Raspberry Pi 1 and even from the 1:41 it took on a cluster of 3 Raspberry Pi’s 2, as seen here. Furthermore when running the word count on the Gutenberg books text file from ‘Because we can geek’, it took about 02:03 as compared to 03:25. Overall the increase in performance makes sense, because the higher rated storage technology, RAM and CPU of the Orange Pi Plus 2e.

OpenELEC on the Orange Pi
Ultrasonic distance sensor

Leave a Reply

Your email address will not be published / Required fields are marked *