Local Apache Spark Quickstart
I wanted to get started with Apache Spark and SparkSQL on my local machine so that I could learn how to operate on large scale geospatial data. This guide walks through Docker configuration, a sample data set in CSV format, and a dead simple script that we’ll submit to Spark via Docker.
Docker Configuration
This is the minimal configuration for setting up Spark on your local machine.
services:
spark-master:
image: bitnami/spark:3.5.0
ports:
- "8080:8080" # Spark UI
- "7077:7077" # Spark Master port
environment:
- SPARK_MODE=master
volumes:
- ./app:/opt/bitnami/spark/app
spark-worker-1:
image: bitnami/spark:3.5.0
depends_on:
- spark-master
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
volumes:
- ./app:/opt/bitnami/spark/app
volumes:
app:
Data Schema
I have a hilariously small data set for Apache Spark, 1671 rows, but this is
the general shape. This data should live in an ./app/
directory, in your
current working directory. This is linked to the Docker compose environment.
API,FTP_LATITUDE,FTP_LONGITUDE,LTP_LATITUDE,LTP_LONGITUDE,SPUD_DATE,CUM_OIL_06_MONTH
4238932145,31.3672544,-103.2766876,31.3602855,-103.2783914,1998-03-12 00:00:00,382.0
4238932290,31.1431862,-103.3475596,31.1557462,-103.3462224,2004-01-06 00:00:00,1.0
4238932446,31.2874187,-103.4865769,31.2867779,-103.493138,2006-12-29 00:00:00,27486.0
4238932457,31.2749213,-103.3917072,31.2746103,-103.4038642,2007-03-04 00:00:00,2773.0
4238932486,31.3818528,-103.3206018,31.3751552,-103.3222759,2007-08-30 00:00:00,12009.0
4238932498,31.3796621,-103.3130481,31.3875774,-103.3110596,2007-12-20 00:00:00,14062.0
4238932504,31.4099656,-103.2628631,31.4126112,-103.2619259,2008-02-03 00:00:00,36509.0
4238932522,31.2887026,-103.7434503,31.2789202,-103.7478164,2008-03-26 00:00:00,11.0
4238932532,31.417923,-103.4118079,31.4267432,-103.4098413,2008-04-18 00:00:00,4676.0
Python Code
This script just prints out the number of rows in ./app/data.csv
, and the
column names in the CSV. This should sit next to the data file in ./app/
. We
do this so that the Docker image can “see” the script and the data.
# filename: ./app/analysis_01.py
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("Analysis 01").getOrCreate()
df = spark.read.csv("/opt/bitnami/spark/app/data.csv", header=True, inferSchema=True)
print("Number of rows:", df.count())
print("Schema:")
df.printSchema()
spark.stop()
Send Code to Spark
To run this, we’ll send a command to the Docker image using the
docker exec -it
functionality. I created a run.bash
script to do this; it
takes a single CLI argument, the container ID of the spark-master
container.
You can figure this out by running docker ps
then copy/pasting that ID when
running bash run.bash <docker-container-id>
. Here’s the contents of
run.bash
.
#!/usr/bin/bash
docker exec -it $1 spark-submit --master local[*] /opt/bitnami/spark/app/analysis_01.py
At the end of it all, you should see some log statements where the Executor is talking to the TaskScheduler, and the DAGScheduler, etc., and then our output, and then log chatter about shutting everything down.
25/01/27 21:53:30 INFO Executor: Finished task 0.0 in stage 5.0 (TID 4). 3995 bytes result sent to driver
25/01/27 21:53:30 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 4) in 41 ms on fc849b91cd54 (executor driver) (1/1)
25/01/27 21:53:30 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
25/01/27 21:53:30 INFO DAGScheduler: ResultStage 5 (count at NativeMethodAccessorImpl.java:0) finished in 0.051 s
25/01/27 21:53:30 INFO DAGScheduler: Job 4 is finished. Cancelling potential speculative or zombie tasks for this job
25/01/27 21:53:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 5: Stage finished
25/01/27 21:53:30 INFO DAGScheduler: Job 4 finished: count at NativeMethodAccessorImpl.java:0, took 0.057968 s
Number of rows: 1671
Schema:
root
|-- API: long (nullable = true)
|-- FTP_LATITUDE: double (nullable = true)
|-- FTP_LONGITUDE: double (nullable = true)
|-- LTP_LATITUDE: double (nullable = true)
|-- LTP_LONGITUDE: double (nullable = true)
|-- SPUD_DATE: timestamp (nullable = true)
|-- CUM_OIL_06_MONTH: double (nullable = true)
25/01/27 21:53:30 INFO SparkContext: SparkContext is stopping with exitCode 0.
25/01/27 21:53:30 INFO SparkUI: Stopped Spark web UI at http://fc849b91cd54:4040
25/01/27 21:53:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
25/01/27 21:53:30 INFO MemoryStore: MemoryStore cleared
25/01/27 21:53:30 INFO BlockManager: BlockManager stopped
25/01/27 21:53:30 INFO BlockManagerMaster: BlockManagerMaster stopped
25/01/27 21:53:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/01/27 21:53:30 INFO SparkContext: Successfully stopped SparkContext
25/01/27 21:53:30 INFO ShutdownHookManager: Shutdown hook called
25/01/27 21:53:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-294fb3d4-6f43-4ffb-962a-5226b2e88219
25/01/27 21:53:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-b18b2d8a-91c2-48df-ae39-aca16aec9da5
25/01/27 21:53:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-b18b2d8a-91c2-48df-ae39-aca16aec9da5/pyspark-a218847b-d988-4df2-8514-f635f2855f04
So, that’s not a whole lot of computation, but that’s the Hello World for using SparkSQL, with real world data, using Docker compose, locally.