Taming Big Data with Spark Streaming and Scala – Getting Started

(We have discontinued our Facebook group due to abuse.)

Get the Course Materials

If you’d prefer to get all of the course materials at once instead of downloading them with individual lectures, you’ll find a zip package at:

http://media.sundog-soft.com/SparkStreaming/SparkStreamingFiles.zip

Installing Apache Spark and Scala

Windows: (keep scrolling for MacOS and Linux)

  1. Install a JDK (Java Development Kit) from http://www.oracle.com/technetwork/java/javase/downloads/index.html . Keep track of where you installed the JDK; you’ll need that later. DO NOT INSTALL THE LATEST RELEASE – INSTALL JAVA 8. Spark is not compatible with Java 9 or newer. And BE SURE TO INSTALL JAVA TO A PATH WITH NO SPACES IN IT. Don’t accept the default path that goes into “Program Files” on Windows, as that has a space.
  2. Download a pre-built version of Apache Spark 3.0.0 or 2.4.4 (depending on the version of the course you are taking – If your course title says “Spark 3” in it then you want Spark 3.0.0; otherwise stick with 2.4.4) from https://spark.apache.org/downloads.html
  3. If necessary, download and install WinRAR so you can extract the .tgz file you downloaded. http://www.rarlab.com/download.htm
  4. Extract the Spark archive, and copy its contents into C:\spark after creating that directory. You should end up with directories like c:\spark\bin, c:\spark\conf, etc.
  5. Download winutils.exe from https://sundogs3.amazonaws.com/winutils.exe and move it into a C:\winutils\bin folder that you’ve created. (note, this is a 64-bit application. If you are on a 32-bit version of Windows, you’ll need to search for a 32-bit build of winutils.exe for Hadoop.)
  6. Create a c:\tmp\hive directory, and cd into c:\winutils\bin, and run winutils.exe chmod 777 c:\tmp\hive
  7. Open the the c:\spark\conf folder, and make sure “File Name Extensions” is checked in the “view” tab of Windows Explorer. Rename the log4j.properties.template file to log4j.properties. Edit this file (using Wordpad or something similar) and change the error level from INFO to ERROR for log4j.rootCategory
  8. Right-click your Windows menu, select Control Panel, System and Security, and then System. Click on “Advanced System Settings” and then the “Environment Variables” button.
  9. Add the following new USER variables:
    1. SPARK_HOME c:\spark
    2. JAVA_HOME (the path you installed the JDK to in step 1, for example C:\JDK)
    3. HADOOP HOME c:\winutils
  1. Add the following paths to your PATH user variable:

%SPARK_HOME%\bin

%JAVA_HOME%\bin

  1. Close the environment variable screen and the control panels.
  2. Install the latest Scala IDE from http://scala-ide.org/download/sdk.html
  3. Test it out!
    1. Open up a Windows command prompt in administrator mode.
    2. Enter cd c:\spark and then dir to get a directory listing.
    3. Look for a text file we can play with, like README.md or CHANGES.txt
    4. Enter spark-shell
    5. At this point you should have a scala> prompt. If not, double check the steps above.
    6. Enter val rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
    7. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
    8. Hit control-D to exit the spark shell, and close the console window
    9. You’ve got everything set up! Hooray!

MacOS

Step 1: Install Spark

Method A: By Hand

The best setup instructions for Spark “the hard way” on MacOS are at the following link:

https://medium.com/luckspark/installing-spark-2-3-0-on-macos-high-sierra-276a127b8b85

Spark 2.3.0 is no longer available, but the same process should work with Spark 2.4.4 or Spark 3. If your course title says Spark 3 in it, then you want Spark 3 – otherwise stick with Spark 2.

Method B: Using Homebrew

An alternative on MacOS is using a tool called Homebrew to install Java, Scala, and Spark – it’s easier, but first you need to install Homebrew itself. Make sure you end up with the correct version of Spark for the course you’re taking, though. Newer editions of the course use Spark 3.0.0.

Step by step instructions are at https://www.tutorialkart.com/apache-spark/how-to-install-spark-on-mac-os/

Step 2: Install the Scala IDE

Install the Scala IDE from http://scala-org/download/sdk.html

Step 3: Test it out!

  1. cd to the directory apache-spark was installed to and then ls to get a directory listing.
  2. Look for a text file we can play with, like README.md or CHANGES.txt
  3. Enter spark-shell
  4. At this point you should have a scala> prompt. If not, double check the steps above.
  5. Enter val rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
  6. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
  7. Hit control-D to exit the spark shell, and close the console window
  8. You’ve got everything set up! Hooray!

Linux

  1. Install Java, Scala, and Spark according to the particulars of your specific OS. A good starting point is http://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm (but be sure to install Spark 3.0.0 or 2.4.4 depending on which version of the course you’re taking – if the course title doesn’t say Spark 3 in it, then you want Spark 2.)
  2. Install the Scala IDE from http://scalaorg/download/sdk.html
  3. Test it out!
    1. cd to the directory apache-spark was installed to and then ls to get a directory listing.
    2. Look for a text file we can play with, like README.md or CHANGES.txt
    3. Enter spark-shell
    4. At this point you should have a scala> prompt. If not, double check the steps above.
    5. Enter val rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
    6. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
    7. Hit control-D to exit the spark shell, and close the console window
    8. You’ve got everything set up! Hooray!

Optional: Join Our List

Join our low-frequency mailing list to stay informed on new courses and promotions from Sundog Education. As a thank you, we’ll send you a free course on Deep Learning and Neural Networks with Python, and discounts on all of Sundog Education’s other courses! Just click the button to get started.