The materials below are for students of our Taming Big Data with Apache Spark and Python online course. If you’re not enrolled and somehow found your way here, please do the right thing and consider enrolling.

Course Materials

On Udemy, you’ll find the materials attached to each lecture as resources. If you’d like to get them all at once, you can grab it all from https://s3.amazonaws.com/media.sundog-soft.com/Udemy/SparkCourse.zip

A copy of the slides is at https://s3.amazonaws.com/media.sundog-soft.com/Udemy/SparkSlides.zip – please do not redistribute these; they are for your personal reference only.

Installing Apache Spark and Python

Windows: (keep scrolling for MacOS and Linux)

Install a JDK (Java Development Kit) from http://www.oracle.com/technetwork/java/javase/downloads/index.htm l . SPARK 4 IS ONLY COMPATIBLE WITH JAVA 17 or 21 at this time.
Download a pre-built version of Apache Spark 4 from https://spark.apache.org/downloads.htm l
If necessary, download and install WinRAR so you can extract the .tgz file you downloaded. http://www.rarlab.com/download.ht m
Extract the Spark archive, and copy its contents into C:\spark after creating that directory. You should end up with directories like c:\spark\bin, c:\spark\conf, etc.
Open the the c:\spark\conf folder, and make sure “File Name Extensions” is checked in the “view” tab of Windows Explorer. Rename the log4j2.properties.template file to log4j2.properties. Edit this file (using Wordpad or something similar) and change the error level from “info” to “error” for log4j.rootCategory
Download hadoop.zip from https://s3.amazonaws.com/media.sundog-soft.com/Udemy/hadoop.zip and unzip it to c:\hadoop (you should end up with c:\hadoop\bin directory containing winutils.exe and hadoop.dll)
Right-click your Windows menu, select Control Panel, System and Security, and then System. Click on “Advanced System Settings” and then the “Environment Variables” button.
Add the following new USER variables:
1. SPARK_HOME c:\spark
2. PYSPARK_PYTHON python
3. HADOOP_HOME c:\hadoop
Add the following path to your PATH user variable:

%SPARK_HOME%\bin

Close the environment variable screen and the control panels.
Install the latest Anaconda for Python 3 from anaconda.com. If you already use some other Python environment, that’s OK – you can use it instead, as long as it is a Python 3 environment.
Test it out!
1. Open up your Start menu and select “Anaconda Prompt” from the Anaconda3 menu.
2. Currently, Spark is NOT compatible with Python 3.12 or newer! Create a Python 3.10 environment for use with this course:
  conda create -n py310 python=3.10
  conda activate py310
  pip install py4j
  conda install pandas
  conda install pyarrow
  Remember to run “conda activate py310” whenever working with this course.
3. Enter cd c:\spark and then dir to get a directory listing.
4. Look for a text file we can play with, like README.md or CHANGES.txt
5. Enter pyspark
6. At this point you should have a >>> prompt. If not, double check the steps above.
7. Enter rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
8. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
9. Enter quit() to exit the spark shell, and close the console window
10. You’ve got everything set up! Hooray!

MacOS

Step 1: Install Apache Spark

Using Homebrew

Install Homebrew if you don’t have it already by entering this from a terminal prompt:
/usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”
Enter:
brew install openjdk@17
brew install scala
brew install apache-spark
Create a log4j.properties file via
1. cd /opt/homebrew/Cellar/apache-spark/3.5.2/libexec/conf (substitute 3.5.2 for the version actually installed – the path may be slightly different on your system.)
2. cp log4j2.properties.template log4j2.properties
Edit the log4j2.properties file and change the log level from “info” to “error” on log4j.rootCategory.

Step 2: Install Anaconda

Install the latest Anaconda for Python 3 from anaconda.com, if you don’t already have Python installed.

Step 3: Test it out!

Open up a terminal
Currently, Spark is NOT compatible with Python 3.12 or newer! Create a Python 3.10 environment for use with this course:
conda create -n py310 python=3.10
conda activate py310
pip install py4j
conda install pandas
conda install pyarrow
cd into the directory where you installed Spark, and then ls to get a directory listing.
Look for a text file we can play with, like README.md or CHANGES.txt
Enter pyspark
At this point you should have a >>> prompt. If not, double check the steps above.
Enter rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
Enter quit() to exit the spark shell, and close the terminal window
You’ve got everything set up! Hooray!

Linux

Install Java, Scala, and Spark according to the particulars of your specific OS. For example, on Ubuntu, you might do the following:
– Install Java with sudo apt install openjdk-21-jdk -y
– Download the latest Spark distribution from https://spark.apache.org/, and extract it with the tar -xvzf command.
– Move it somewhere permanent, like under /opt/spark
– Add the necessary environment variables to the end of your ~/.bashrc file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_PYTHON=python3
– source ~/.bashrc to apply those new environment variables
Install the latest Anaconda for Python 3 from anaconda.com
Test it out!
1. Open up a terminal
2. Currently, Spark is NOT compatible with Python 3.12 or newer! Create a Python 3.10 environment for use with this course:
  conda create -n py310 python=3.10
  conda activate py310
  pip install py4j
  conda install pandas
  conda install pyarrow
3. cd into the directory you installed Spark, and do an ls to see what’s in there.
4. Look for a text file we can play with, like README.md or CHANGES.txt
5. Enter pyspark
6. At this point you should have a >>> prompt. If not, double check the steps above.
7. Enter rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
8. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
9. Enter quit() to exit the spark shell, and close the console window
10. You’ve got everything set up! Hooray!

Optional: Join Our List

Join our low-frequency mailing list to stay informed on new courses and promotions from Sundog Education. As a thank you, we’ll send you a free course on Deep Learning and Neural Networks with Python, and discounts on all of Sundog Education’s other courses! Just click the button to get started.

Taming Big Data with Apache Spark and Python – Getting Started