Big Data Processing for the AI Era — LakeSail

LakeSail is an open-source computation framework designed to unify batch processing, stream processing, and compute-intensive AI workloads. It provides a drop-in replacement for Apache Spark SQL and the Spark DataFrame API in both single-host and distributed environments, enabling fast, efficient data processing with minimal code changes.

Overview

LakeSail (Sail) aims to accelerate data processing tasks while reducing hardware costs and preserving ease of use. It achieves 4x processing speed improvements and 0 additional code changes in benchmark evaluations, powered by a Rust-based in-house SQL parser and architecture designed to optimize performance.
Sail offers compatibility with Spark SQL and the Spark DataFrame API, allowing users to run existing Spark workflows with minimal disruption.
It provides tooling and guidance to get started, including installation commands, server setup, and examples of connecting PySpark to a Sail-backed SQL engine.

Get Started

Install options include a Python package: pip install "pysail[spark]" alongside command-line tools for Sail.
Run a Sail server locally or in a cluster (e.g., Kubernetes) and connect to it from PySpark without modifying existing PySpark code.

Example setup snippets:

CLI installation and server start
bash
pip install "pysail[spark]"
sail spark server --port 50051
Connecting PySpark to Sail
python
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:50051").getOrCreate()
spark.sql("SELECT 1 + 1").show()
Kubernetes deployment
bash
kubectl apply -f sail.yaml
kubectl -n sail port-forward service/sail-spark-server 50051:50051

Architecture and Capabilities

Sail is designed to shape the future of distributed data processing with a focus on performance, scalability, and ease of integration.
It provides a drop-in replacement for Spark SQL and the DataFrame API, enabling seamless migration from Spark-based workloads.
The architecture supports both single-host and distributed deployment models.

Features

Drop-in replacement for Spark SQL and Spark DataFrame API
High performance with Rust-based SQL parsing and optimization
Compatibility with PySpark with no code changes required
Supports both batch and streaming workloads (unified processing)
Kubernetes-ready deployment options for scalable environments
PyPI and CLI tooling for easy installation and operation
Commercial support options available with flexible coverage

How It Works

Install Sail and run a Sail server.
Connect your existing PySpark code to the Sail server using the sc://localhost:50051 URL.
Execute SQL and DataFrame operations; Sail handles execution planning and distributed computation under the hood.

Support and Community

LakeSail offers commercial support with various options tailored to user needs.
Community resources include a public issue tracker and public Slack channel; enterprise plans provide private issue tracking and dedicated channels with guaranteed response times.

Safety and Legal Considerations

Use Sail in compliance with its license and terms of service. Refer to official documentation for deployment best practices, security configurations, and data governance.

Core Information

Name: Sail (LakeSail)
Purpose: Unify batch, stream, and AI workloads with a Spark-compatible interface
Language and Runtime: Rust-based core; Python (PySpark) integration
Deployment: Single-host or distributed; Kubernetes support
Licensing: Open source with commercial support options

LakeSail

Introduction

Tags

Featured

n8n

DataFast

ElevenLabs

SuperX

LakeSail Product Information