How To Install Apache Spark In Ubuntu

Before diving into the installation process, it’s essential to ensure that your system meets the necessary prerequisites. Apache Spark requires a compatible version of Java and Python for optimal performance. Here are the steps to prepare your Ubuntu system:

Java: Install Java Development Kit (JDK) 8 or higher.
Python: Although not mandatory, having Python 2.7+/3.4+ will enable you to use PySpark, the Python API for Spark.
Scala: If you plan to write applications in Scala, install Scala 2.10.x or 2.11.x.

Installing Java

To install Java on Ubuntu, you can use the following commands:

sudo apt update
sudo apt install default-jdk

Verify the installation by checking the Java version:

java -version

Installing Python and Scala

Python usually comes pre-installed with Ubuntu. To check if it’s installed and determine the version, run:

python --version

For Scala, you can install it using:

sudo apt-get install scala

Again, verify the installation with:

scala -version

Downloading Apache Spark

The next step is to download the latest version of Apache Spark from the official website. You can do this either through the browser or via the command line using wget.

Using wget to Download Spark

First, navigate to the Apache Spark downloads page (https://spark.apache.org/downloads.html) to find the link to the latest Spark release tarball. Then, use wget to download it directly to your server:

wget [URL_OF_SPARK_TARBALL]

Replace [URL_OF_SPARK_TARBALL] with the actual URL you obtained from the downloads page.

Installing Apache Spark

Once you have downloaded the tarball, follow these steps to install Apache Spark on your Ubuntu system.

Extracting the Tarball

Use the tar command to extract the Spark package:

tar xvf [SPARK_TARBALL_NAME].tgz

Make sure to replace [SPARK_TARBALL_NAME] with the name of the file you downloaded.

Moving Spark to a Permanent Location

It’s a good practice to move the extracted Spark directory to a more permanent location. The /opt directory is commonly used for such purposes:

sudo mv [SPARK_DIRECTORY_NAME] /opt/spark

Configuring Environment Variables

To run Spark conveniently from any location, you need to set up some environment variables by editing your .bashrc or .profile file.

Editing .bashrc or .profile

Open your .bashrc file in a text editor:

nano ~/.bashrc

Add the following lines at the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and exit the editor, then source the .bashrc file to apply the changes:

source ~/.bashrc

Starting Apache Spark

With Apache Spark installed and environment variables configured, you can now start the Spark master and worker processes.

Starting the Master Server

To start the Spark master server, use the following command:

start-master.sh

This will start the master daemon and print the URL of the master web UI, typically http://localhost:8080.

Starting Worker Nodes

You can connect one or more workers to the master by running:

start-worker.sh spark://[MASTER_HOST]:7077

Replace [MASTER_HOST] with the hostname or IP address of the machine where the master is running.

Running a Spark Shell

Apache Spark provides interactive shells that allow you to perform data operations interactively. There are shells for Scala and Python (PySpark).

Starting the Scala Shell

To start the Scala shell, simply run:

spark-shell

Starting the PySpark Shell

For the Python shell, run:

pyspark

Troubleshooting Common Installation Issues

Sometimes, you might encounter issues during the installation process. Here are some common problems and their solutions:

Java Not Found: Ensure JAVA_HOME is set correctly and points to the JDK installation directory.
Permission Denied: Make sure you have the necessary permissions to move files to /opt or other directories.
Master/Worker Not Starting: Check the logs located in $SPARK_HOME/logs for any error messages.

Frequently Asked Questions

Do I need Hadoop to run Apache Spark?

No, Apache Spark does not require Hadoop; it can run in standalone mode. However, if you want to process data stored in HDFS, YARN, or other Hadoop-supported systems, you’ll need to have Hadoop installed.

Can I install Apache Spark on a virtual machine?

Yes, you can install Apache Spark on a virtual machine as long as the VM runs a supported operating system like Ubuntu.

How do I update Apache Spark to a newer version?

To update Apache Spark, you would download the new version’s tarball and repeat the installation process. Don’t forget to update the SPARK_HOME variable to point to the new directory.

Is it possible to run Spark on multiple nodes?

Yes, Spark can be configured to run on a cluster of machines for distributed data processing. This involves setting up a master node and multiple worker nodes.

What should I do if I encounter “Out of Memory” errors?

If you encounter “Out of Memory” errors, consider increasing the memory allocation for Spark or optimizing your Spark job to use resources more efficiently.

References

Apache Spark Official Documentation: https://spark.apache.org/docs/latest/
Download Apache Spark: https://spark.apache.org/downloads.html
Ubuntu Documentation: https://ubuntu.com/server/docs
Oracle Java Documentation: https://docs.oracle.com/en/java/
Python Official Website: https://www.python.org/
Scala Official Website: https://www.scala-lang.org/