Job Dependencies

Overview

With the platform release 23.4.1 (and all previous releases), dynamic provisioning of dependencies using the Spark packages field doesn’t work. This is a known problem with Spark and is tracked here.

The Stackable Spark-on-Kubernetes operator enables users to run Apache Spark workloads in a Kubernetes cluster easily by eliminating the requirement of having a local Spark installation. For this purpose, Stackble provides ready made Docker images with recent versions of Apache Spark and Python - for PySpark jobs - that provide the basis for running those workloads. Users of the Stackable Spark-on-Kubernetes operator can run their workloads on any recent Kubernetes cluster by applying a SparkApplication custom resource in which the job code, job dependencies, input and output data locations can be specified. The Stackable operator translates the user’s SparkApplication manifest into a Kubernetes Job object and handles control to the Apache Spark scheduler for Kubernetes to construct the necessary driver and executor Pods.

When the job is finished, the Pods are terminated and the Kubernetes Job is completed.

The base images provided by Stackable contain only the minimum of components to run Spark workloads. This is done mostly for performance and compatibility reasons. Many Spark workloads build on top of third party libraries and frameworks and thus depend on additional packages that are not included in the Stackable images. This guide explains how users can provision their Spark jobs with additional dependencies.

Dependency provisioning

There are multiple ways to submit Apache Spark jobs with external dependencies. Each has its own advantages and disadvantages and the choice of one over the other depends on existing technical and managerial constraints.

To provision job dependencies in their workloads, users have to construct their SparkApplication with one of the following dependency specifications:

Hardened or encapsulated job images
Dependency volumes
Spark native package coordinates and Python requirements

The following table provides a high level overview of the relevant aspects of each method.

Dependency specification	Job image size	Reproduciblity	Dev-op cost
Encapsulated job images	Large	Guaranteed	Medium to High
Dependency volumes	Small	Guaranteed	Small to Medium
Spark and Python packages	Small	Not guranteed	Small

Dependency specification

Job image size

Reproduciblity

Dev-op cost

Encapsulated job images

Large

Guaranteed

Medium to High

Dependency volumes

Small

Guaranteed

Small to Medium

Spark and Python packages

Small

Not guranteed

Small

Hardened or encapsulated job images

With this method, users submit a SparkApplication for which the sparkImage refers to a Docker image containing Apache Spark itself, the job code and dependencies required by the job. It is recommended the users base their image on one of the Stackable images to ensure compatibility with the Stackable operator.

Since all packages required to run the Spark job are bundled in the image, the size of this image tends to get very large while at the same time guaranteeing reproducibility between submissions.

Example:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: spark-pi
spec:
  sparkImage:
    productVersion: 3.5.1 (1)
  mode: cluster
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: /stackable/spark/examples/jars/spark-examples.jar (2)
  executor:
    instances: 3

1	Name of the encapsulated image.
2	Name of the Spark job to run.

Dependency volumes

With this method, the user provisions the job dependencies from a PersistentVolume as shown in this example:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-pvc
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.1
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny-tlc-report-1.0-SNAPSHOT.jar (1)
  mainClass: org.example.App (2)
  args:
    - "'s3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'"
  sparkConf: (3)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
    "spark.driver.extraClassPath": "/dependencies/jars/*"
    "spark.executor.extraClassPath": "/dependencies/jars/*"
  volumes:
    - name: job-deps (4)
      persistentVolumeClaim:
        claimName: pvc-ksv
  driver:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies (5)

1	Job artifact located on S3.
2	Job main class
3	Spark dependencies: the credentials provider (the user knows what is relevant here) plus dependencies needed to access external resources (in this case, in s3, accessed without credentials)
4	the name of the volume mount backed by a `PersistentVolumeClaim` that must be pre-existing
5	the path on the volume mount: this is referenced in the `sparkConf` section where the extra class path is defined for the driver and executors

The Spark operator has no control over the contents of the dependency volume. It is the responsibility of the user to make sure all required dependencies are installed in the correct versions.

A PersistentVolumeClaim and the associated PersistentVolume can be defined like this:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-ksv (1)
spec:
  storageClassName: standard
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 2Gi
  hostPath:
    path: /some-host-location
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-ksv (2)
spec:
  volumeName: pv-ksv (1)
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: batch/v1
kind: Job
metadata:
  name: aws-deps
spec:
  template:
    spec:
      restartPolicy: Never
      volumes:
        - name: job-deps (3)
          persistentVolumeClaim:
            claimName: pvc-ksv (2)
      containers:
        - name: aws-deps
          volumeMounts:
            - name: job-deps (4)
              mountPath: /stackable/spark/dependencies

1	Reference to a `PersistentVolume`, defining some cluster-reachable storage
2	The name of the `PersistentVolumeClaim` that references the PV
3	Defines a `Volume` backed by the PVC, local to the Custom Resource
4	Defines the `VolumeMount` that is used by the Custom Resource

Spark native package coordinates and Python requirements

The last and most flexible way to provision dependencies is to use the built-in spark-submit support for Maven package coordinates.

The snippet below showcases how to add Apache Iceberg support to a Spark (version 3.4.x) application.

spec:
  sparkConf:
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
    spark.sql.catalog.spark_catalog.type: hive
    spark.sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.local.type: hadoop
    spark.sql.catalog.local.warehouse: /tmp/warehouse
  deps:
    packages:
      - org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1

Currently it’s not possible to provision dependencies that are loaded by the JVM’s (system class loader)[https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ClassLoader.html#getSystemClassLoader()]. Such dependencies include JDBC drivers. If you need access to JDBC sources from your Spark application, consider building your own custom Spark image.

Spark version 3.3.x has a known bug that prevents this mechanism to work.

When submitting PySpark jobs, users can specify pip requirements that are installed before the driver and executor pods are created.

Here is an example:

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: example-sparkapp-external-dependencies
  namespace: default
spec:
  sparkImage:
    productVersion: 3.5.1
  mode: cluster
  mainApplicationFile: s3a://stackable-spark-k8s-jars/jobs/ny_tlc_report.py (1)
  args:
    - "--input 's3a://nyc-tlc/trip data/yellow_tripdata_2021-07.csv'" (2)
  deps:
    requirements:
      - tabulate==0.8.9  (3)
  sparkConf:  (4)
    "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
    "spark.driver.extraClassPath": "/dependencies/jars/*"
    "spark.executor.extraClassPath": "/dependencies/jars/*"
  volumes:
    - name: job-deps  (5)
      persistentVolumeClaim:
        claimName: pvc-ksv
  driver:
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies  (6)
  executor:
    replicas: 3
    config:
      volumeMounts:
        - name: job-deps
          mountPath: /dependencies  (6)

Note the section requirements. Also note that in this case, a sparkImage that bundles Python has to be provisioned.