Spark on Mac

Introduction to Apache Spark

Spark is getting its due attention as the lightning fast distributed computing engine. It is an improvement over the Hadoop Map Reduce. There are significant improvements in Spark that makes it superior for performing analytics on big data.

  1. In-memory storage and computation for iterative algorithms
  2. Supports object-oriented and functional programming paradigms with scala, python and java
  3. interactive shell for quick test and deployment as binaries
  4. Works with Yarn / Mesos – Resource Managers next-generation Hadoop


Install Java

$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Install brew

ruby -e "$(curl -fsSL" 

Install Hadoop

$ brew install hadoop

Brew installs it in the below folder. Add this folder to your bash_profile path

$ vi ~/.bash_profile
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.6.0
$ source ~/.bash_profile
  • Edit Configuration files
/usr/local/Cellar/hadoop/2.6.0$ cd libexec/etc/hadoop/
vi ~/core-site.xml















vi yarn-site.xml







vi mapred-site.xml







vi hdfs-site,xml







Install Apache Maven

$ brew install maven

$ vi ~/.bash_profile
export MAVEN_HOME=/usr/local/Cellar/maven/3.2.5

Install Apache Spark

download the latest tarball for Spark  and unzip in /usr/local

$ tar xvf ~/Downloads/spark-1.3.1.tar.gz /usr/local
/user/local/spark-1.3.1 $ mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

Once the spark installation is complete and successful, add the following

$ vi ~/.bash_profile
export SPARK_HOME=/usr/local/spark-1.3.1
$ source ~/.bash_profile

Resolving Errors

Error1: Detected Maven Version: 3.2.5 is not in the allowed range 3.3.3.
Solution: Upgrade to higher version of maven

$brew update
$brew install maven

Error2: [error] missing or invalid dependency detected while loading class file ‘WebUI.class’.
Solution: Switch to Scala 2.11 version in the spark directory

$ cd /usr/local/spark-1.5.2
$ ./dev/ 2.11

Data Exploration

  • Start hadoop
$ hadoop dfsadmin -safemode leave
$ hadoop fs -mkdir -p /user/<username>/datasets/

Download the restaurant_ratings

Open the file in Excel and save the data as restaurant_ratings.csv

Copy the data in hadoop that you want to parse.

$ hadoop fs -copyFromLocal ./restaurant_ratings.csv /user/<username>/datasets/
  • Start yarn
  • Start spark interactive shell
$ spark-shell --master yarn-client

scala >

// load data
 val ratings = sc.textFile("/user/<username>/datasets/restaurant_ratings.csv")
// sample 10 records
// remove header
 def isHeader(line: String) = {if (line.contains("userID")) false else true}
// apply the remove header filter
 val rhRatings = ratings.filter(isHeader)
// define a template for each class
 case class restRating (userId: String, placeId: String, rating: Int)
// parse the data
 def parsed (line: String): restRating = {
 val words = line.split(",")
 val userId = words(0)
 val placeId = words(1)
 val rating = words(2).toInt
 restRating(userId, placeId, rating)
// apply the parsed function to the RDD[String] to convert into RDD[restRatings]
 val parsedRecords =
// extract just the ratings
 val ratings = => rr.rating)
// cache into memory
// sample data
 val samples = parsedRecords.take(20)
// pick two columns
 val parsedRDD = parsedRecords.take(20).map(rr => (rr.userId, rr.rating))
import org.apache.spark.SparkContext._
// group by user Id _1 indicates column 1
 val grouped = parsedRDD.groupBy(rr => (rr._1))
// output results
 grouped.mapValues(x => x.size).foreach(println)
 // Summary Statistics
// Count by value
 val matchedCounts = => md.rating).countByValue()
// scala.collection.Map[Int,Long] = Map(0 -> 254, 2 -> 486, 1 -> 421)
// Convert them to sequence so we can sort them by column 1,2...N
val matchedCountSeq = matchedCounts.toSeq
// Sort by column 1
// sort by column 2
// descending order sorting
// statistics => md.rating).stats()

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s