PySpark Data Sources
Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API.
Installation
If you want to install all extra dependencies, use:
Usage
from pyspark_datasources.fake import FakeDataSource
# Register the data source
spark.dataSource.register(FakeDataSource)
spark.read.format("fake").load().show()
# For streaming data generation
spark.readStream.format("fake").load().writeStream.format("console").start()
Data Sources
Data Source | Short Name | Description | Dependencies |
---|---|---|---|
GithubDataSource | github |
Read pull requests from a Github repository | None |
FakeDataSource | fake |
Generate fake data using the Faker library |
faker |
HuggingFaceDatasets | huggingface |
Read datasets from the HuggingFace Hub | datasets |
StockDataSource | stock |
Read stock data from Alpha Vantage | None |
SimpleJsonDataSource | simplejson |
Write JSON data to Databricks DBFS | databricks-sdk |
GoogleSheetsDataSource | googlesheets |
Read table from public Google Sheets document | None |
KaggleDataSource | kaggle |
Read datasets from Kaggle | kagglehub , pandas |