Bases: DataSource
A DataSource for reading table from public Google Sheets.
Name: googlesheets
Schema: By default, all columns are treated as strings and the header row defines the column names.
Options
url
: The URL of the Google Sheets document.
path
: The ID of the Google Sheets document.
sheet_id
: The ID of the worksheet within the document.
has_header
: Whether the sheet has a header row. Default is true
.
Either url
or path
must be specified, but not both.
Examples:
Register the data source.
>>> from pyspark_datasources import GoogleSheetsDataSource
>>> spark.dataSource.register(GoogleSheetsDataSource)
Load data from a public Google Sheets document using path
and optional sheet_id
.
>>> spreadsheet_id = "10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0"
>>> spark.read.format("googlesheets").options(sheet_id="0").load(spreadsheet_id).show()
+-------+---------+---------+-------+
|country| latitude|longitude| name|
+-------+---------+---------+-------+
| AD|42.546245| 1.601554|Andorra|
| ...| ...| ...| ...|
+-------+---------+---------+-------+
Load data from a public Google Sheets document using url
.
>>> url = "https://docs.google.com/spreadsheets/d/10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0/edit?gid=0#gid=0"
>>> spark.read.format("googlesheets").options(url=url).load().show()
+-------+---------+--------+-------+
|country| latitude|ongitude| name|
+-------+---------+--------+-------+
| AD|42.546245|1.601554|Andorra|
| ...| ...| ...| ...|
+-------+---------+--------+-------+
Specify custom schema.
>>> schema = "id string, lat double, long double, name string"
>>> spark.read.format("googlesheets").schema(schema).options(url=url).load().show()
+---+---------+--------+-------+
| id| lat| long| name|
+---+---------+--------+-------+
| AD|42.546245|1.601554|Andorra|
|...| ...| ...| ...|
+---+---------+--------+-------+
Treat first row as data instead of header.
>>> schema = "c1 string, c2 string, c3 string, c4 string"
>>> spark.read.format("googlesheets").schema(schema).options(url=url, has_header="false").load().show()
+-------+---------+---------+-------+
| c1| c2| c3| c4|
+-------+---------+---------+-------+
|country| latitude|longitude| name|
| AD|42.546245| 1.601554|Andorra|
| ...| ...| ...| ...|
+-------+---------+---------+-------+
Source code in pyspark_datasources/googlesheets.py
| class GoogleSheetsDataSource(DataSource):
"""
A DataSource for reading table from public Google Sheets.
Name: `googlesheets`
Schema: By default, all columns are treated as strings and the header row defines the column names.
Options
--------
- `url`: The URL of the Google Sheets document.
- `path`: The ID of the Google Sheets document.
- `sheet_id`: The ID of the worksheet within the document.
- `has_header`: Whether the sheet has a header row. Default is `true`.
Either `url` or `path` must be specified, but not both.
Examples
--------
Register the data source.
>>> from pyspark_datasources import GoogleSheetsDataSource
>>> spark.dataSource.register(GoogleSheetsDataSource)
Load data from a public Google Sheets document using `path` and optional `sheet_id`.
>>> spreadsheet_id = "10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0"
>>> spark.read.format("googlesheets").options(sheet_id="0").load(spreadsheet_id).show()
+-------+---------+---------+-------+
|country| latitude|longitude| name|
+-------+---------+---------+-------+
| AD|42.546245| 1.601554|Andorra|
| ...| ...| ...| ...|
+-------+---------+---------+-------+
Load data from a public Google Sheets document using `url`.
>>> url = "https://docs.google.com/spreadsheets/d/10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0/edit?gid=0#gid=0"
>>> spark.read.format("googlesheets").options(url=url).load().show()
+-------+---------+--------+-------+
|country| latitude|ongitude| name|
+-------+---------+--------+-------+
| AD|42.546245|1.601554|Andorra|
| ...| ...| ...| ...|
+-------+---------+--------+-------+
Specify custom schema.
>>> schema = "id string, lat double, long double, name string"
>>> spark.read.format("googlesheets").schema(schema).options(url=url).load().show()
+---+---------+--------+-------+
| id| lat| long| name|
+---+---------+--------+-------+
| AD|42.546245|1.601554|Andorra|
|...| ...| ...| ...|
+---+---------+--------+-------+
Treat first row as data instead of header.
>>> schema = "c1 string, c2 string, c3 string, c4 string"
>>> spark.read.format("googlesheets").schema(schema).options(url=url, has_header="false").load().show()
+-------+---------+---------+-------+
| c1| c2| c3| c4|
+-------+---------+---------+-------+
|country| latitude|longitude| name|
| AD|42.546245| 1.601554|Andorra|
| ...| ...| ...| ...|
+-------+---------+---------+-------+
"""
@classmethod
def name(self):
return "googlesheets"
def __init__(self, options: Dict[str, str]):
if "url" in options:
sheet = Sheet.from_url(options.pop("url"))
elif "path" in options:
sheet = Sheet(options.pop("path"), options.pop("sheet_id", None))
else:
raise ValueError(
"You must specify either `url` or `path` (spreadsheet ID)."
)
has_header = options.pop("has_header", "true").lower() == "true"
self.parameters = Parameters(sheet, has_header)
def schema(self) -> StructType:
if not self.parameters.has_header:
raise ValueError("Custom schema is required when `has_header` is false")
import pandas as pd
# Read schema from the first row of the sheet
df = pd.read_csv(self.parameters.sheet.get_query_url("select * limit 1"))
return StructType([StructField(col, StringType()) for col in df.columns])
def reader(self, schema: StructType) -> DataSourceReader:
return GoogleSheetsReader(self.parameters, schema)
|