
Install Hadoop on the Orange Pi
Install Java on Orange Pi
Let’s put the yellow elephant on the Orange Pi! Hadoop is a framework for distributed data storage and processing used in big data applications. We will set up a single node Hadoop cluster on the Orange Pi. This may or may not be practical in a production environment, but it’s a good way to learn to configure Hadoop, load a file on the hdfs and run a MapReduce job.
The first thing to do is to install Oracle Java. That’s because the version of Open Java Development Kit that the Orange Pi comes with is not compatible with Hadoop. Also, Oracle Java will generally run faster than Open JDK on the Orange Pi. At first I upgraded the OpenJDK to the suitable version, then I installed Oracle Java and I noticed a significant increase in performance.
To install Java, visit the Oracle JDK download page. From here select the Linux ARM 32 version, which is the one to use for the Orange Pi Plus 2e. Following the steps from here, unpack the archive:
sudo tar zxvf jdk-8u101-linux-arm32-vfp-hflt.tar.gz -C /opt
Next, run the following commands and select the newly installed Java version:
sudo update-alternatives --install /usr/bin/javac javac /opt/jdk1.8.0_101/bin/javac 1 sudo update-alternatives --install /usr/bin/java java /opt/jdk1.8.0_101/bin/java 1 sudo update-alternatives --config javac sudo update-alternatives --config java
With the correct Java set up you can now install Hadoop.
Install and configure Hadoop on Orange Pi
In this section I will follow the steps for a Raspberry Pi Hadoop installation, presented on Jonas Widriksson’s blog and on the ‘Because we can geek’ blog. First, I create an additional user, dedicated for running the Hadoop jobs:
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser sudo adduser hduser sudo
Next, get the latest Hadoop version:
wget ftp://apache.belnet.be/mirrors/ftp.apache.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
Unpack it and assign the newly created user as the owner of the hadoop folder:
sudo tar -xvzf hadoop-2.7.3.tar.gz -C /opt/ cd /opt sudo chown -R hduser:hadoop hadoop-2.7.3/
Then log in with the hadoop user:
su hduser
Next we need to add some Hadoop environment variables. Let’s do it by adding them to the end of the .bashrc file:
nano ~/.bashrc
Then add:
export JAVA_HOME=/opt/jdk1.8.0_101 export HADOOP_HOME=/opt/hadoop-2.7.3 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin
To apply the changes run:
source ~/.bashrc
We need to configure the hadoop-env.sh file and add the JAVA_HOME variable again.
cd $HADOOP_CONF_DIR nano hadoop-env.sh
Here we change the JAVA_HOME variable to:
export JAVA_HOME=/opt/jdk1.8.0_101/
To test that everything went ok you can type:
hadoop version
Next we need to edit some configuration files for Hadoop. The first one is:
nano core-site.xml
Insert the following between the configuration tags:
<property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/hdfs/tmp</value> </property>
For setting up a single-node in a pseudo-distributed mode use this:
nano hdfs-site.xml
<property> <name>dfs.replication</name> <value>1</value> </property>
To run a MapReduce job on YARN we need to configure the following files:
cp mapred-site.xml.template mapred-site.xml nano mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Next we configure YARN. In this file you can adjust the CPU or RAM resources to suit your device:
nano yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>4</value> </property>
Let’s create the Hadoop distributed file system:
sudo mkdir -p /hdfs/tmp sudo chown hduser:hadoop /hdfs/tmp chmod 750 /hdfs/tmp hdfs namenode -format
Next start the hdfs and yarn:
cd $HADOOP_HOME/sbin start-dfs.sh start-yarn.sh
To check that everything started properly, type in:
jps
You should see these processes:
Now lets put a file on the distributed file system. I chose to upload the small text file in this Raspberry Pi Hadoop cluster post, so I can compare the execution times. Put the file in you home directory. Then:
cd hdfs dfs -copyFromLocal smallfile.txt /smallfile.txt
This will copy the file to the hdfs. Let’s run the classic word count example on this file:
hadoop jar /opt/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /smallfile.txt /smallfile-result
You can check the status of your Hadoop jobs at http://localhost:8088/cluster
The execution time (mm:ss) for the small text file was 01:25, which is an improvement from the 2:17 on one Raspberry Pi 1 and even from the 1:41 it took on a cluster of 3 Raspberry Pi’s 2, as seen here. Furthermore when running the word count on the Gutenberg books text file from ‘Because we can geek’, it took about 02:03 as compared to 03:25. Overall the increase in performance makes sense, because the higher rated storage technology, RAM and CPU of the Orange Pi Plus 2e.