Apache Spark Interpreter

Apache Spark is an open source processing engine built around speed, ease of use, and sophisticated analytics. ZEPL provides several interpreters for Apache Spark.

A single Spark context is shared among %spark, %spark.pyspark, %spark.sql, %spark.r sessions. ZEPL currently runs single node Apache Spark 2.1.0 in the containers of each notebooks.

Load data from AWS S3

To create dataset from AWS S3, it is recommended to use s3a. First, you need to configure access key and secret key.

%spark
sc.hadoopConfiguration.set("fs.s3a.access.key", "[YOUR_ACCESS_KEY]")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "[YOUR_SECRET_KEY]")

And then your Spark context will able to create dataset from S3

%spark
val data = spark.read.text("s3a://apache-zeppelin/tutorial/bank/bank.csv")

Alternatively, you can load data using s3n. In this case, access key and secret key can be configured in following way.

%spark
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "[YOUR_ACCESS_KEY]")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "[YOUR_SECRET_KEY]")

And then your Spark context will able to access data from S3

%spark
val data = sc.textFile("s3n://....")

Check data loading example notebook.

Load dependencies

When your code requires external library, %spark.dep helps load additional libraries from maven repository. %spark.dep interpreter leverages Scala environment. You can write scala expressions to call dependency load APIs.

Note that %spark.dep should be the first interpreter run in the notebook before %spark, %spark.pyspark, %spark.sql. Otherwise, %spark.dep will print error message and you need to shutdown and start the container for the notebook again.

Check Dependency loading example notebook.

Usages

%spark.dep
z.reset() // clean up previously added artifact and repository

// add maven repository
z.addRepo("RepoName").url("RepoURL")

// add maven snapshot repository
z.addRepo("RepoName").url("RepoURL").snapshot()

// add credentials for private maven repository
z.addRepo("RepoName").url("RepoURL").username("username").password("password")

// add artifact from filesystem
z.load("/path/to.jar")

// add artifact from maven repository, with no dependency
z.load("groupId:artifactId:version").excludeAll()

// add artifact recursively
z.load("groupId:artifactId:version")

// add artifact recursively except comma separated GroupID:ArtifactId list
z.load("groupId:artifactId:version").exclude("groupId:artifactId,groupId:artifactId, ...")

// exclude with pattern
z.load("groupId:artifactId:version").exclude(*)
z.load("groupId:artifactId:version").exclude("groupId:artifactId:*")
z.load("groupId:artifactId:version").exclude("groupId:*")

// local() skips adding artifact to spark clusters (skipping sc.addJar())
z.load("groupId:artifactId:version").local()

SparkR

Installed libraries - ggplot2, googleVis, knitr, data.table, devtools

%spark.r
library(googleVis)
bubble <- gvisBubbleChart(Fruits, idvar="Fruit",
                          xvar="Sales", yvar="Expenses",
                          colorvar="Year", sizevar="Profit",
                          options=list(
                            hAxis='{minValue:75, maxValue:125}'))
print(bubble, tag = 'chart')
%spark.r
plot(iris, col = heat.colors(3))
%spark.r
library(ggplot2)
pres_rating <- data.frame(
  rating = as.numeric(presidents),
  year = as.numeric(floor(time(presidents))),
  quarter = as.numeric(cycle(presidents))
)
p <- ggplot(pres_rating, aes(x=year, y=quarter, fill=rating))