Documentation Index

Fetch the complete documentation index at: https://docs.wasabi.com/llms.txt

Use this file to discover all available pages before exploring further.

Apache Iceberg With Wasabi

Prev Next

This guide explains how to configure Apache Spark with Apache Iceberg to use Wasabi Hot Cloud Storage as an S3-compatible object storage backend. Apache Iceberg is an open table format for large analytic datasets.

Requirements

  • This solution was tested on Ubuntu Server 26.04 LTS with Java 21 (OpenJDK 21.0.11) and Apache Spark 4.1.2.

  • A Wasabi account with a bucket created. Creating a Bucket for details on this procedure.

  • Wasabi access key and secret key. See 3—Creating a User Account and Access Key for further information.

Find Your Bundled Versions

Spark ships with a bundled Hadoop client. The hadoop-aws package version must exactly match it. Find the version by checking the JAR filename, along with the Scala and Spark versions.

ls $SPARK_HOME/jars/hadoop-client-*.jar
ls $SPARK_HOME/jars/scala-library-*.jar
ls $SPARK_HOME/jars/spark-core_*.jar

The Hadoop filename (for example, hadoop-client-runtime-3.4.2.jar) tells you the version to use in the next step with the Scala and Spark versions as well.

Configuring spark-defaults.conf

Use your favorite Linux text editor such as vi or nano to edit the Spark configuration file.

vi /opt/spark/conf/spark-defaults.conf

Add the following lines with the following information.

  • Wasabi access and secret keys. Replace YOUR_WASABI_ACCESS_KEY and YOUR_WASABI_SECRET_KEY with your keys.

  • Bucket name (replace “mt-iceberg” with your bucket name).

  • Wasabi region endpoint.

  • The Hadoop version found above.  Replace 3.4.2 in hadoop-aws in the last line with the version found above.

  • The iceberg-spark-runtime version (4.1_2.13:1.11.0) in the last line must match your Spark version (4.1.x) and Scala version (2.13) found above.

spark.hadoop.fs.s3a.endpoint                    https://s3.us-east-2.wasabisys.com
spark.hadoop.fs.s3a.access.key                  YOUR_WASABI_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key                  YOUR_WASABI_SECRET_KEY
spark.hadoop.fs.s3a.aws.credentials.provider    org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.path.style.access           true
spark.sql.extensions                            org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.iceberg_catalog               org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg_catalog.type          hadoop
spark.sql.catalog.iceberg_catalog.warehouse     s3a://mt-iceberg/
spark.jars.packages                             org.apache.hadoop:hadoop-aws:3.4.2,org.apache.iceberg:iceberg-spark-runtime-4.1_2.13:1.11.0

This example uses Wasabi’s us-east-2 storage region. Use the endpoint URL for the region where your bucket is located. For a list of regions and their endpoint URLs, see Service URLs for Wasabi’s Storage Regions.

Starting spark-shell

Start the Spark shell. The packages defined in spark-defaults.conf will be downloaded automatically on first launch.

spark-shell

Creating a Database and Table

Once inside the Spark shell, create a database (namespace) and an Iceberg table.

spark.sql("CREATE DATABASE iceberg_catalog.db")

spark.sql("CREATE TABLE iceberg_catalog.db.test (id INT, name STRING) USING iceberg")

Inserting and Querying Data

Insert some rows and query them back.

spark.sql("INSERT INTO iceberg_catalog.db.test VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Charlie')")

spark.sql("SELECT * FROM iceberg_catalog.db.test").show()

Verifying Iceberg Metadata

Query the Iceberg snapshots table to confirm that metadata is being stored in your Wasabi bucket.

spark.sql("SELECT * FROM iceberg_catalog.db.test.snapshots").show()

The output will show a snapshot with a manifest_list path beginning with s3a://mt-iceberg/, confirming that Iceberg is managing the table and storing all data and metadata in Wasabi.

Verifying Your Bucket Data

Verify that Iceberg has written data and metadata to your Wasabi bucket using the Wasabi Console. Log in to the console, click Buckets in the left navigation. Click the name of your bucket (mt-iceberg in this example):

Drill down through the folder structure to see your data in the bucket. Navigate to db/test to see the data and metadata folders created by Iceberg.

Catalog Structure

The fully qualified name for any Iceberg table follows this hierarchy.

catalog.database.table

In the example above:

  • iceberg_catalog is the catalog, defined in spark-defaults.conf and pointing to your Wasabi bucket.

  • db is the database (also called a namespace).

  • test is the table.

You can create multiple databases under the same catalog without any additional configuration.

spark.sql("CREATE DATABASE iceberg_catalog.project1")

spark.sql("CREATE DATABASE iceberg_catalog.project2")

Apache Polaris (Optional — Out of Scope)

The setup described in this article uses a Hadoop-based catalog, which stores Iceberg metadata directly in your Wasabi bucket. This is simple to configure and works well for single-engine environments where only Spark is reading and writing tables.

For more advanced use cases, Apache Polaris is an open-source REST catalog for Apache Iceberg that acts as a central metadata service. Rather than each engine connecting directly to object storage to find table metadata, all engines connect to Polaris over a standard REST API. This enables:

  • Multi-engine access—Spark, Trino, Flink, and other engines can all read and write the same Iceberg tables simultaneously through a single catalog

  • Centralized access control—Role-based permissions managed in one place rather than through storage bucket policies

  • Credential vending—Polaris issues short-lived, scoped credentials to engines rather than requiring each engine to hold long-lived storage keys

  • Audit and governance—All catalog operations are logged through a single service

Apache Polaris became a top-level Apache Software Foundation project in February 2026. It supports S3-compatible storage including Wasabi, and can use PostgreSQL as a persistent metastore backend.

Configuring Polaris is significantly more involved than the Hadoop catalog approach and is outside the scope of this article. For more information, see: