How to install Apache Spark cluster computing framework on Debian 10

[*]How to install Apache Spark cluster computing framework on Debian 10

Apache Spark is a free and open source cluster computing framework for analysis, machine learning and graph processing of large amounts of data. Spark comes with over 80 advanced operators that let you build parallel applications and use it interactively from Scala, Python, R, and SQL Shell. It is a lightning-fast in-memory data processing engine designed for data science. It provides a rich feature set, including speed, fault tolerance, real-time stream processing, in-memory computing, advanced analytics, and more.

In this tutorial, we will show you how to install Apache Spark on a Debian 10 server.

prerequisites

  • A server running Debian 10 with 2 GB of RAM.
  • A root password is configured on your server.

getting Started

Before you begin, it is recommended to update the server with the latest version. You can update it with:

apt-get update -y apt-get upgrade -y

After updating the server, restart the server to implement the changes.

Install Java

Apache Spark is written in Java. Therefore, you will need to install Java on your system. By default, the latest version of Java is available in the Debian 10 default repository. You can use the following command to install:

apt-get install default-jdk -y

After installing Java, use the following command to verify the installed version of Java:

java --version

You should get the following output:

openjdk 11.0.5 2019-10-15
OpenJDK Runtime Environment (build 11.0.5+10-post-Debian-1deb10u1)
OpenJDK 64-Bit Server VM (build 11.0.5+10-post-Debian-1deb10u1, mixed mode, sharing)

Download Apache Spark

First, you need to download the latest version of Apache Spark from its official website. At the time of writing, the latest version of Apache Spark is 3.0. You can download it to the / opt directory using the following command:

cd /opt wget http://apachemirror.wuchna.com/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz

After the download is complete, extract the downloaded file using the following command:

tar -xvzf spark-3.0.0-preview2-bin-hadoop2.7.tgz

Next, rename the extracted directory to spark as follows:

mv spark-3.0.0-preview2-bin-hadoop2.7 spark

Next, you will need to set up the environment for Spark. You can do this by editing the ~ / .bashrc file:

nano ~/.bashrc

Add the following line at the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and close the file when you are finished. Then, activate the environment using the following command:

source ~/.bashrc

Start master server

You can now start the master server with:

start-master.sh

You should get the following output:

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-debian10.out

By default, Apache Spark listens on port 8080. You can use the following command to verify:

netstat -ant | grep 8080

Output:

tcp6       0      0 :::8080                 :::*                    LISTEN

Now, open a web browser and enter the URL http: // server-ip-address: 8080. You should see the following page:

Make a note of “Spark URL”spark: // debian10: 7077“Will be used to start the Spark worker process.

Start Spark Worker process

You can now start the Spark worker process using:

start-slave.sh spark://debian10:7077

You should get the following output:

starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-debian10.out

Access Spark Shell

Spark Shell is an interactive environment that provides an easy way to learn APIs and analyze data interactively. You can access the Spark Shell using the following command:

spark-shell

You should see the following output:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.0-preview2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/12/29 15:53:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://debian10:4040
Spark context available as 'sc' (master = local[*], app id = local-1577634806690).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.0.0-preview2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.5)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

From here, you can learn how to get the most out of Apache Spark quickly and easily.

If you want to stop Spark Master and Slave server, run the following command:

stop-slave.sh stop-master.sh

That’s it, you have successfully installed Apache Spark on your Debian 10 server. For more information, see the official Spark documentation at: Spark documentation.

Source

Sidebar