Install PySpark from scratch and integrate with Jupyter

Why this note?

Recently, I have to do some data analysis for my own project, and data size (compressed) is about 40Gb. What does that mean?

  • I used to do analysis tasks at work. But this time I cannot use company’s infrastructure (Hadoop, Yarn, Spark, …), this is personal matter.
  • I have Jupyter notebook on my poor local machine, however, the data size is not small enough to run on this little guy who always have about 100Mb memory free.
  • I wanted to use PySpark DataFrame instead of Pandas DataFrame because I’m more familiar with PySpark’s API and documentation.


  • You have a fresh Ubuntu 16.04 (Xenial Xerus): To be clear, it doesn’t have to be fresh, and doesn’t have to be Ubuntu 16.04. But having a clean one with exact version/distribution will make your life easier to install and debug. Believe me, messing with an existing server with bunch of installed packages with different version will drive you crazy, easily to steal couple of days from you.
  • The server has a public IP address, let’s say it is Note, if you tend to run it locally, then this is unnecessary. Local IP or even localhost would be more than enough.
  • The server’s firewall rule must opens at least 2 ports 8888 and 18080: Similar as above, if this is local environment, then you can completely ignore this.


Swap Space

For this tutorial, I used AWS EC2 Free tier, this instance has only 1Gb of memory, not enough to handle heavy system like Spark. Besides, having swap space for backing-up RAM is a good solution when memory is fully occupied for any reasons.

In this case I used only 2Gb of SSD to make swap space (out of 30Gb of free EC2). Feel free to change it if you want more.

sudo dd if=/dev/zero of=/var/swapfile bs=1M count=2048
sudo chmod 600 /var/swapfile
sudo mkswap /var/swapfile
echo /var/swapfile none swap defaults 0 0 | sudo tee -a /etc/fstab
sudo swapon -a

Prerequisite packages

There are some utility packages we’d need for later, so let’s install it now:

  • unzip: Uncompression
  • python-pip: Python and Python package manager (2.7.12)
  • default-jdk: JDK (8)
sudo apt-get update
sudo apt-get install -y unzip python-pip default-jdk


Spark use SBT (together with Maven) as package management and build tool.

echo "deb /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp:// --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt

sbt about # to verify


We’re going to build Spark from code. Well, it doesn’t sound scary as it seems. It’s Scala (Java) code base in general, so building from source code is far easier and stable than from C source code (which is heavily rely on operation systems and distributions).

The build process sbt package will take some time (about 30-45 mins). So feel free to grab a coffee, relax and go back instead of staring at the console.

git clone
cd ~/spark
git checkout v2.1.1
sbt package # this guy takes time

After having binaries and packages built, let’s add them to our current path. Remember

tee -a ~/.bash_aliases <<EOF
export PATH=\$PATH:/home/ubuntu/spark/sbin:/home/ubuntu/spark/bin
export SPARK_HOME=/home/ubuntu/spark/
source ~/.bashrc

pyspark --version # to verify

Spark History Server

This server needs a place to keep the events. By default the directory is /tmp/spark-events. So, let’s make one and start it.

mkdir /tmp/spark-events
bash ~/spark/sbin/

You now can access to Spark History Server at and have something like this


It’s time to install Jupyter, our super power analyzing tools.

sudo python -m pip install --upgrade pip
sudo python -m pip install jupyter

jupyter notebook list # to verify

And we need to connect it to PySpark. In order to not messing up with Spark, PySpark (which I have seen plenty of times), we create a different shell script to start Jupyter and use it as the “contact point” to trigger Jupyter server later.

tee <<EOF
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip= --no-browser"
pyspark --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=file:/tmp/spark-events "\$@"

From now on, all you need to start your Jupyter is run this bash, and you will be able to access it at But how can you verify if your Jupyter is connected to Spark? Use this sc.applicationId to test it, you will see something like this:

Also, look back to your Spark History Server, the application ID should match perfectly.


This is absolutely optional, but it would be cool to have a domain or subdomain to access to Jupyter and Spark History Server instead of IP address. In this case I use Nginx proxy reverse engine. Assuming you have configure your DNS to have these subdomains and point to your server

So first thing first, we need to install Nginx:

sudo apt-get install -y nginx

Nginx configuration for Jupyter

upstream notebook {
  server localhost:8888;

  listen 80;

  location / {
    proxy_pass            http://notebook;
    proxy_set_header      Host $host;

  location ~ /api/kernels/ {
    proxy_pass            http://notebook;
    proxy_set_header      Host $host;
    # websocket support
    proxy_http_version    1.1;
    proxy_set_header      Upgrade "websocket";
    proxy_set_header      Connection "Upgrade";
    proxy_read_timeout    86400;

  location ~ /terminals/ {
    proxy_pass            http://notebook;
    proxy_set_header      Host $host;
    # websocket support
    proxy_http_version    1.1;
    proxy_set_header      Upgrade "websocket";
    proxy_set_header      Connection "Upgrade";
    proxy_read_timeout    86400;


Nginx configuration for Spark History Server

server {
  listen 80;

  location / {
    proxy_pass            http://localhost:18080;

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s