Skip to content

PySpark Data Sources

Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API.

Installation

pip install pyspark-data-sources

If you want to install all extra dependencies, use:

pip install pyspark-data-sources[all]

Usage

from pyspark_datasources import GithubDataSource

# Register the data source
spark.dataSource.register(GithubDataSource)

spark.read.format("github").load("apache/spark").show()

Data Sources

Data Source Short Name Description Dependencies
GithubDataSource github Read pull requests from a Github repository None
FakeDataSource fake Generate fake data using the Faker library faker
HuggingFaceDatasets huggingface Read datasets from the HuggingFace Hub datasets
StockDataSource stock Read stock data from Alpha Vantage None
SimpleJsonDataSource simplejson Read JSON data from a file databricks-sdk