Install Apache Hadoop/HBase on Ubuntu 20.04
The
You can download this article in PDF format to support us through the following link.
Download the guide in PDF format
turn off
The
The
The
This tutorial will try to explain the steps to install Hadoop and HBase on Ubuntu 20.04 (Focal Fossa) Linux server. HBase is an open source distributed non-relational database written in Java and runs on top of Hadoop File Systems (HDFS). HBase allows you to run huge clusters that host very large tables with billions of rows and millions of columns on top of commodity hardware.
This installation guide does not apply to high-availability production settings, but applies to laboratory settings to enable you to develop. Our HBase installation will be completed on a single node Hadoop cluster. The server is an Ubuntu 20.04 virtual machine with the following specifications:
- 16 GB RAM
- 8vCPU.
- 20GB boot disk
- 100GB raw disk for data storage
If your resources do not match the lab settings, you can use existing resources and see if the service can be started.
For CentOS 7, please refer to How to install Apache Hadoop/HBase on CentOS 7.
Install Hadoop on Ubuntu 20.04
The first part will introduce the installation of a single node Hadoop cluster on Ubuntu 20.04 LTS Server. The installation of Ubuntu 20.04 server is beyond the scope of this guide. For how to do this, please consult the virtualization environment documentation.
Step 1: Update the system
Update and selectively upgrade all packages installed on the Ubuntu system:
sudo apt update
sudo apt -y upgrade
sudo reboot
Step 2: Install Java on Ubuntu 20.04
If Java is missing on Ubuntu 20.04, please install Java.
sudo apt update
sudo apt install default-jdk default-jre
After successfully installing Java on Ubuntu 20.04, please use the java command line to confirm the version.
$ java -version
openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-3ubuntu1, mixed mode, sharing)
Group JAVA_HOME
variable.
cat <
Update $PATH and settings.
source /etc/profile.d/hadoop_java.sh
Then test:
$ echo $JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64
reference:
How to set the default Java version on Ubuntu/Debian
Step 3: Create a user account for Hadoop
Let's create a separate user for Hadoop in order to maintain isolation between the Hadoop file system and the Unix file system.
sudo adduser hadoop
sudo usermod -aG sudo hadoop
sudo usermod -aG sudo hadoop
After adding a user, generate an SS key pair for the user.
$ sudo su - hadoop$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: SHA256:mA1b0nzdKcwv/LPktvlA5R9LyNe9UWt+z1z0AjzySt4 [email protected] The key's randomart image is: +---[RSA 2048]----+ | | | o + . . | | o + . = o o| | O . o.o.o=| | + S . *ooB=| | o *=.B| | . . *+=| | o o o.O+| | o E.=o=| +----[SHA256]-----+
Add this user's key to the "authorized ssh key" list.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Verify that the added key can be used for ssh.
$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:42Mx+I3isUOWTzFsuA0ikhNN+cJhxUYzttlZ879y+QI.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 20.04 LTS (GNU/Linux 5.4.0-28-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
$ exit
Step 4: Download and install Hadoop
an examination The latest version of Hadoop Before downloading the version specified here. At the time of writing, this is the version 3.2.1
.
Save the latest version to a variable.
RELEASE="3.2.1"
Then download the Hadoop archive file to the local system.
wget https://www-eu.apache.org/dist/hadoop/common/hadoop-$RELEASE/hadoop-$RELEASE.tar.gz
Unzip the file.
tar -xzvf hadoop-$RELEASE.tar.gz
Move result directory to /usr/local/hadoop
.
sudo mv hadoop-$RELEASE/ /usr/local/hadoop
sudo mkdir /usr/local/hadoop/logs
sudo chown -R hadoop:hadoop /usr/local/hadoop
Group HADOOP_HOME
And add the directory containing Hadoop binaries to your $PATH.
cat <
Source File.
source /etc/profile.d/hadoop_java.sh
Confirm your Hadoop version.
$ hadoop version
Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar
Step 5: Configure Hadoop
All your Hadoop configurations are located /usr/local/hadoop/etc/hadoop/
table of Contents.
Many configuration files need to be modified to complete the installation of Hadoop on Ubuntu 20.04.
First edit JAVA_HOME
In the shell script hadoop-env.sh:
$ sudo vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh # Set JAVA_HOME - Line 54 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
Then configure:
1. core-site.xml
of core-site.xml The file contains the Hadoop cluster information used at startup. These attributes include:
- Port number used by the Hadoop instance
- Memory allocated for the file system
- Memory limits for data storage
- The size of the read/write buffer.
turn on core-site.xml
sudo vim /usr/local/hadoop/etc/hadoop/core-site.xml
Add the following attributes between
fs.default.name
hdfs://localhost:9000
The default file system URI
See the screenshot below.
2. hdfs-site.xml
This file needs to be configured for each host to be used in the cluster. The file contains the following information:
- The namenode and datanode paths on the local file system.
- The value of copying data
In this setup, I want to store the Hadoop infrastructure on a secondary disk – /dev/sdb
.
$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 76.3G 0 disk └─sda1 8:1 0 76.3G 0 part / sdb 8:16 0 100G 0 disk sr0 11:0 1 1024M 0 rom
I partition and mount this disk to /hadoop
table of Contents.
sudo parted -s -- /dev/sdb mklabel gpt
sudo parted -s -a optimal -- /dev/sdb mkpart primary 0% 100%
sudo parted -s -- /dev/sdb align-check optimal 1
sudo mkfs.xfs /dev/sdb1
sudo mkdir /hadoop
echo "/dev/sdb1 /hadoop xfs defaults 0 0" | sudo tee -a /etc/fstab
sudo mount -a
confirm:
$ df -hT | grep /dev/sdb1 /dev/sdb1 xfs 50G 84M 100G 1% /hadoop
Create a directory namenode
with datanode
.
sudo mkdir -p /hadoop/hdfs/{namenode,datanode}
Set ownership to hadoop users and groups.
sudo chown -R hadoop:hadoop /hadoop
Now open the file:
sudo vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Then add the following attributes in between
dfs.replication
1
dfs.name.dir
file:///hadoop/hdfs/namenode
dfs.data.dir
file:///hadoop/hdfs/datanode
See the screenshot below.
3. mapred-site.xml
Set the MapReduce framework to be used here.
sudo vim /usr/local/hadoop/etc/hadoop/mapred-site.xml
The settings are as follows.
mapreduce.framework.name
yarn
4. yarn-site.xml
The settings in this file will be overwritten Hadoop yarn. It defines resource management and job scheduling logic.
sudo vim /usr/local/hadoop/etc/hadoop/yarn-site.xml
plus:
yarn.nodemanager.aux-services
mapreduce_shuffle
This is a screenshot of my configuration.
Step 6: Verify the Hadoop configuration
Initialize Hadoop infrastructure storage.
sudo su - hadoop
hdfs namenode -format
See the output below:
Test HDFS configuration.
$ start-dfs.sh Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [hbase] hbase: Warning: Permanently added 'hbase' (ECDSA) to the list of known hosts.
Finally verify the YARN configuration:
$ start-yarn.shStarting resourcemanagerStarting nodemanagers
Hadoop 3.x provides Web UI ports:
- Name node – The default HTTP port is 9870.
- Resource manager – The default HTTP port is 8088.
- MapReduce JobHistory server – The default HTTP port is 199888.
You can check the port used by hadoop with the following command:
$ ss -tunelp
Sample output is shown below.
Access the Hadoop Web Dashboard http://ServerIP:9870
.
View the Hadoop cluster overview at the following location http://ServerIP:8088
.
Test to see if you can create a directory.
$ hadoop fs -mkdir /test $ hadoop fs -ls / Found 1 items drwxr-xr-x - hadoop supergroup 0 2020-05-29 15:41 /test
Stop Hadoop service
Use the following command:
$ stop-dfs.sh
$ stop-yarn.sh
Install HBase on Ubuntu 20.04
You can choose to install HBase in standalone mode or pseudo distributed mode. The setup process is similar to our Hadoop installation.
Step 1: Download and install HBase
Check Newly released OrStable version Before you download. For production purposes, I recommend that you use the stable version.
VER="2.2.4"
wget http://apache.mirror.gtcomm.net/hbase/stable/hbase-$VER-bin.tar.gz
Extract the downloaded Hbase archive.
tar xvf hbase-$VER-bin.tar.gz
sudo mv hbase-$VER/ /usr/local/HBase/
Update the $PATH value.
cat <
Update your shell environment values.
$ source /etc/profile.d/hadoop_java.sh$ echo $HBASE_HOME/usr/local/HBase
Edit JAVA_HOME
In the shell script hbase-env.sh:
$ sudo vim /usr/local/HBase/conf/hbase-env.sh # Set JAVA_HOME - Line 27 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
Step 2: Configure HBase
We will configure it like Hadoop. All configuration files of HBase are located at /usr/local/HBase/conf/
table of Contents.
hbase-site.xml
Set the data directory to the appropriate location on this file.
Option 1: Install HBase in standalone mode (not recommended)
In standalone mode, all daemons (HMaster, HRegionServer, and ZooKeeper) run in a jvm process/instance
Create HBase root directory.
sudo mkdir -p /hadoop/HBase/HFiles
sudo mkdir -p /hadoop/zookeeper
sudo chown -R hadoop:hadoop /hadoop/
Open the file for editing.
sudo vim /usr/local/HBase/conf/hbase-site.xml
now at
hbase.rootdir
file:/hadoop/HBase/HFiles
hbase.zookeeper.property.dataDir
/hadoop/zookeeper
By default, unless you configure hbase.rootdir
Attribute, your data is still stored in /tmp/.
Now start HBase using the following command start-hbase.sh Script in HBase bin directory.
$ sudo su - hadoop$ start-hbase.sh running master, logging to /usr/local/HBase/logs/hbase-hadoop-master-hbase.out
Option 2: Install HBase in pseudo-distributed mode (recommended)
Our value hbase.rootdir
The previously set will start in independent mode. The pseudo-distributed mode means that HBase is still running entirely on a single host, but each HBase daemon (HMaster, HRegionServer and ZooKeeper) runs as a separate process.
To install HBase in pseudo-distributed mode, set its value to:
hbase.rootdir
hdfs://localhost:8030/hbase
hbase.zookeeper.property.dataDir
/hadoop/zookeeper
hbase.cluster.distributed
true
In this setup, the data is stored in HDFS.
Make sure to create Zookeeper directory.
sudo mkdir -p /hadoop/zookeeper
sudo chown -R hadoop:hadoop /hadoop/
Now start HBase using the following command start-hbase.sh Script in HBase bin directory.
$ sudo su - hadoop $ start-hbase.sh localhost: running zookeeper, logging to /usr/local/HBase/bin/../logs/hbase-hadoop-zookeeper-hbase.out running master, logging to /usr/local/HBase/logs/hbase-hadoop-master-hbase.out : running regionserver, logging to /usr/local/HBase/logs/hbase-hadoop-regionserver-hbase.out
Check the HBase directory in HDFS:
$ hadoop fs -ls /hbaseFound 9 itemsdrwxr-xr-x - hadoop supergroup 0 2019-04-07 09:19 /hbase/.tmpdrwxr-xr-x - hadoop supergroup 0 2019-04-07 09:19 /hbase/MasterProcWALsdrwxr-xr-x - hadoop supergroup 0 2019-04-07 09:19 /hbase/WALsdrwxr-xr-x - hadoop supergroup 0 2019-04-07 09:17 /hbase/corruptdrwxr-xr-x - hadoop supergroup 0 2019-04-07 09:16 /hbase/datadrwxr-xr-x - hadoop supergroup 0 2019-04-07 09:16 /hbase/hbase-rw-r--r-- 1 hadoop supergroup 42 2019-04-07 09:16 /hbase/hbase.id-rw-r--r-- 1 hadoop supergroup 7 2019-04-07 09:16 /hbase/hbase.versiondrwxr-xr-x - hadoop supergroup 0 2019-04-07 09:17 /hbase/oldWALs
Step 3: Manage HMaster and HRegionServer
The HMaster server controls the HBase cluster. You can start up to 9 standby HMaster servers, which makes a total of 10 HMasters the main server.
HRegionServer manages the data in its StoreFiles according to the instructions of HMaster. Typically, each node in the cluster runs an HRegionServer. Running multiple HRegionServers on the same system is very useful for testing in pseudo-distributed mode.
Can use scripts to start and stop primary and regional servers local-master-backup.sh
withlocal-regionservers.sh
respectively.
$ local-master-backup.sh start 2 # Start backup HMaster$ local-regionservers.sh start 3 # Start multiple RegionServers
- Each HMaster uses two ports (16000 and 16010 by default). Port offsets have been added to these ports, so use 2, The standby HMaster will use ports 16002 and 16012
The following command uses ports 16002/16012, 16003/16013, and 16005/16015 to start three backup servers.
$ local-master-backup.sh start 2 3 5
- Each RegionServer requires two ports, the default ports are 16020 and 16030
The following command starts four additional RegionServers, which run on sequential ports starting from 16022/16032 (base port 16020/16030 plus 2).
$ local-regionservers.sh start 2 3 4 5
To stop, please replace start
Parameters and stop
For each command, followed by the server offset to stop. example.
$ local-regionservers.sh stop 5
Start HBase Shell
Before HBase Shell can be used, Hadoop and Hbase should already be running. Here is the correct sequence of starting services.
$ start-all.sh $ start-hbase.sh
Then use the HBase shell.
[email protected]:~$ hbase shell2019-04-07 10:44:43,821 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicableSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/HBase/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]HBase ShellUse "help" to get list of supported commands.Use "exit" to quit this interactive shell.Version 1.4.9, rd625b212e46d01cb17db9ac2e9e927fdb201afa1, Wed Dec 5 11:54:10 PST 2018hbase(main):001:0>
Stop HBase.
stop-hbase.sh
You have successfully installed Hadoop and HBase on Ubuntu 20.04.
Reading books:
Hadoop: the definitive guide: Internet-scale storage and analysis
$ 59.99 $ 23.95
In stock
13 new
Buy now
Amazon.com
At the end of June 1, 2020, at 8:39 am
Hadoop explained
In stock
New 1 New
Buy now
Amazon.com
At the end of June 1, 2020, at 8:39 am
Features
Release date | 2014-06-16T00: 00:00.000Z |
Language | English |
Pages | 156 |
Release date | 2014-06-16T00: 00:00.000Z |
format | Kindle eBook |
Hadoop application architecture
$49.99 $44.43
In stock
18 new 18 new
Free shipping from $37.72 to 27 used $10.68 from $37.72 starting at 27 from $37.72
Buy now
Amazon.com
At the end of June 1, 2020, at 8:39 am
HBase: the definitive guide: random access to your planet size data
$39.99 $14.75
In stock
18 new 18 new
Buy now
Amazon.com
At the end of June 1, 2020, at 8:39 am
Features
Part Number | 978-1-4493-9610-7 |
Is an adult product | |
Version | 1 |
Language | English |
Pages | 556 |
Release date | 2011-09-23T00: 00: 01Z |
Big data: principles and best practices of scalable real-time data systems
$49.99 $35.49
In stock
11 new items
Free shipping from $10.03 used from $26.00 31 for $26.00 31
Buy now
Amazon.com
At the end of June 1, 2020, at 8:39 am
Features
Part Number | 43171-600463 |
Is an adult product | |
Release date | 2015-05-10T00: 00: 01Z |
Version | the first |
Language | English |
Pages | 328 |
Release date | 2015-05-10T00: 00: 01Z |
Design data-intensive applications: the big idea behind reliable, scalable and maintainable systems
$ 59.99 $ 34.99
In stock
27 new items
Free shipping from $30.99 23 from $31.48 for use from $30.99 for use from $30.99
Buy now
Amazon.com
At the end of June 1, 2020, at 8:39 am
Features
Part Number | 41641073 |
Version | 1 |
Language | English |
Pages | 616 |
Release date | 2017-04-11T00: 00: 01Z |
reference:
The
You can download this article in PDF format to support us through the following link.
Download the guide in PDF format
turn off
The
The
The