PySpark Data Sources
Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API.
Installation
If you want to install all extra dependencies, use:
Usage
from pyspark_datasources import GithubDataSource
# Register the data source
spark.dataSource.register(GithubDataSource)
spark.read.format("github").load("apache/spark").show()
Data Sources
Data Source | Short Name | Description | Dependencies |
---|---|---|---|
GithubDataSource | github |
Read pull requests from a Github repository | None |
FakeDataSource | fake |
Generate fake data using the Faker library |
faker |
HuggingFaceDatasets | huggingface |
Read datasets from the HuggingFace Hub | datasets |
StockDataSource | stock |
Read stock data from Alpha Vantage | None |
SimpleJsonDataSource | simplejson |
Read JSON data from a file | databricks-sdk |