GoogleSheetsDataSource

Bases: DataSource

A DataSource for reading table from public Google Sheets.

Name: googlesheets

Schema: By default, all columns are treated as strings and the header row defines the column names.

Options

url: The URL of the Google Sheets document.
path: The ID of the Google Sheets document.
sheet_id: The ID of the worksheet within the document.
has_header: Whether the sheet has a header row. Default is true.

Either url or path must be specified, but not both.

Examples:

Register the data source.

>>> from pyspark_datasources import GoogleSheetsDataSource
>>> spark.dataSource.register(GoogleSheetsDataSource)

Load data from a public Google Sheets document using path and optional sheet_id.

>>> spreadsheet_id = "10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0"
>>> spark.read.format("googlesheets").options(sheet_id="0").load(spreadsheet_id).show()
+-------+---------+---------+-------+
|country| latitude|longitude|   name|
+-------+---------+---------+-------+
|     AD|42.546245| 1.601554|Andorra|
|    ...|      ...|      ...|    ...|
+-------+---------+---------+-------+

Load data from a public Google Sheets document using url.

>>> url = "https://docs.google.com/spreadsheets/d/10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0/edit?gid=0#gid=0"
>>> spark.read.format("googlesheets").options(url=url).load().show()
+-------+---------+--------+-------+
|country| latitude|ongitude|   name|
+-------+---------+--------+-------+
|     AD|42.546245|1.601554|Andorra|
|    ...|      ...|     ...|    ...|
+-------+---------+--------+-------+

Specify custom schema.

>>> schema = "id string, lat double, long double, name string"
>>> spark.read.format("googlesheets").schema(schema).options(url=url).load().show()
+---+---------+--------+-------+
| id|      lat|    long|   name|
+---+---------+--------+-------+
| AD|42.546245|1.601554|Andorra|
|...|      ...|     ...|    ...|
+---+---------+--------+-------+

Treat first row as data instead of header.

>>> schema = "c1 string, c2 string, c3 string, c4 string"
>>> spark.read.format("googlesheets").schema(schema).options(url=url, has_header="false").load().show()
+-------+---------+---------+-------+
|     c1|       c2|       c3|     c4|
+-------+---------+---------+-------+
|country| latitude|longitude|   name|
|     AD|42.546245| 1.601554|Andorra|
|    ...|      ...|      ...|    ...|
+-------+---------+---------+-------+

Source code in pyspark_datasources/googlesheets.py

class GoogleSheetsDataSource(DataSource):
    """
    A DataSource for reading table from public Google Sheets.

    Name: `googlesheets`

    Schema: By default, all columns are treated as strings and the header row defines the column names.

    Options
    --------
    - `url`: The URL of the Google Sheets document.
    - `path`: The ID of the Google Sheets document.
    - `sheet_id`: The ID of the worksheet within the document.
    - `has_header`: Whether the sheet has a header row. Default is `true`.

    Either `url` or `path` must be specified, but not both.

    Examples
    --------
    Register the data source.

    >>> from pyspark_datasources import GoogleSheetsDataSource
    >>> spark.dataSource.register(GoogleSheetsDataSource)

    Load data from a public Google Sheets document using `path` and optional `sheet_id`.

    >>> spreadsheet_id = "10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0"
    >>> spark.read.format("googlesheets").options(sheet_id="0").load(spreadsheet_id).show()
    +-------+---------+---------+-------+
    |country| latitude|longitude|   name|
    +-------+---------+---------+-------+
    |     AD|42.546245| 1.601554|Andorra|
    |    ...|      ...|      ...|    ...|
    +-------+---------+---------+-------+

    Load data from a public Google Sheets document using `url`.

    >>> url = "https://docs.google.com/spreadsheets/d/10pD8oRN3RTBJq976RKPWHuxYy0Qa_JOoGFpsaS0Lop0/edit?gid=0#gid=0"
    >>> spark.read.format("googlesheets").options(url=url).load().show()
    +-------+---------+--------+-------+
    |country| latitude|ongitude|   name|
    +-------+---------+--------+-------+
    |     AD|42.546245|1.601554|Andorra|
    |    ...|      ...|     ...|    ...|
    +-------+---------+--------+-------+

    Specify custom schema.

    >>> schema = "id string, lat double, long double, name string"
    >>> spark.read.format("googlesheets").schema(schema).options(url=url).load().show()
    +---+---------+--------+-------+
    | id|      lat|    long|   name|
    +---+---------+--------+-------+
    | AD|42.546245|1.601554|Andorra|
    |...|      ...|     ...|    ...|
    +---+---------+--------+-------+

    Treat first row as data instead of header.

    >>> schema = "c1 string, c2 string, c3 string, c4 string"
    >>> spark.read.format("googlesheets").schema(schema).options(url=url, has_header="false").load().show()
    +-------+---------+---------+-------+
    |     c1|       c2|       c3|     c4|
    +-------+---------+---------+-------+
    |country| latitude|longitude|   name|
    |     AD|42.546245| 1.601554|Andorra|
    |    ...|      ...|      ...|    ...|
    +-------+---------+---------+-------+
    """

    @classmethod
    def name(self):
        return "googlesheets"

    def __init__(self, options: Dict[str, str]):
        if "url" in options:
            sheet = Sheet.from_url(options.pop("url"))
        elif "path" in options:
            sheet = Sheet(options.pop("path"), options.pop("sheet_id", None))
        else:
            raise ValueError(
                "You must specify either `url` or `path` (spreadsheet ID)."
            )
        has_header = options.pop("has_header", "true").lower() == "true"
        self.parameters = Parameters(sheet, has_header)

    def schema(self) -> StructType:
        if not self.parameters.has_header:
            raise ValueError("Custom schema is required when `has_header` is false")

        import pandas as pd

        # Read schema from the first row of the sheet
        df = pd.read_csv(self.parameters.sheet.get_query_url("select * limit 1"))
        return StructType([StructField(col, StringType()) for col in df.columns])

    def reader(self, schema: StructType) -> DataSourceReader:
        return GoogleSheetsReader(self.parameters, schema)