PySpark Data Sources

Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API.

Installation

pip install pyspark-data-sources

If you want to install all extra dependencies, use:

pip install pyspark-data-sources[all]

Usage

from pyspark_datasources.fake import FakeDataSource

# Register the data source
spark.dataSource.register(FakeDataSource)

spark.read.format("fake").load().show()

# For streaming data generation
spark.readStream.format("fake").load().writeStream.format("console").start()

Data Sources

Data Source	Short Name	Description	Dependencies
GithubDataSource	`github`	Read pull requests from a Github repository	None
FakeDataSource	`fake`	Generate fake data using the `Faker` library	`faker`
HuggingFaceDatasets	`huggingface`	Read datasets from the HuggingFace Hub	`datasets`
StockDataSource	`stock`	Read stock data from Alpha Vantage	None
SimpleJsonDataSource	`simplejson`	Write JSON data to Databricks DBFS	`databricks-sdk`
GoogleSheetsDataSource	`googlesheets`	Read table from public Google Sheets document	None
KaggleDataSource	`kaggle`	Read datasets from Kaggle	`kagglehub`, `pandas`