Skip to content

dgroomes/spark-playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

spark-playground

📚 Learning and exploring Apache Spark.

Standalone subprojects

This repository illustrates different concepts, patterns and examples via standalone subprojects. Each subproject is completely independent of the others and do not depend on the root project. This standalone subproject constraint forces the subprojects to be complete and maximizes the reader's chances of successfully running, understanding, and re-using the code.

The subprojects include:

hello-world/

Get up and running with Spark in an interactive way by using the Spark SQL CLI and Spark shell.

See the README in hello-world/.

commandline/

Use Spark in a way optimized for ad hoc commandline data-wrangling: less logging verbosity and a smaller file footprint.

See the README in commandline/.

Wish List

General clean-ups, TODOs and things I wish to implement for this project:

  • DONE hello world-style example
    • Let's start with the basics: Spark shell?
    • I already forgot why I had decided to use sbt instead of Gradle.
  • DONE Pare down interactive/ to just spark-sql and spark-shell and take the external table concept and bring that it it's own project. interactive/ will become a hello-world/ and the new project will be Something like light-weight,
    • DONE Logging config.
  • Iceberg example (docker? or the Iceberg Java test impl?)
  • Distributed example? Docker?
  • Make some high level notes and stuff about de-coupling from Hadoop, etc.
  • Hive example. This is an important component in the general Spark culture. The official Hive Docker example should be useful here. I was able to build Hive from source but sadly it takes Java 8 and that's a sign that we need to move on from it a bit, and cordon it off into a Docker container.
  • [commandline/] Explore https://openjdk.org/jeps/483 for improved startup time in the commandline/ project.
  • [commandline/] Consider ejecting from the builtin spark-sql and spark-shell runners and make my own. The printing of startup messages like "Spark Web UI ..." makes it impossible to capture the output of the command. I'm curious how much core can be re-used and how much wrapper machinery gets in the way.

About

📚 Learning and exploring Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages