Skip to content

GithubDataSource

Bases: DataSource

A DataSource for reading pull requests data from Github.

Name: github

Schema: id int, title string, author string, created_at string, updated_at string

Examples:

Register the data source.

>>> from pyspark_datasources import GithubDataSource
>>> spark.dataSource.register(GithubDataSource)

Load pull requests data from a public Github repository.

>>> spark.read.format("github").load("apache/spark").show()
+---+--------------------+--------+--------------------+--------------------+
| id|               title|  author|          created_at|          updated_at|
+---+--------------------+--------+--------------------+--------------------+
|  1|Initial commit      |  matei |2014-02-03T18:47:...|2014-02-03T18:47:...|
|...|                 ...|     ...|                 ...|                 ...|
+---+--------------------+--------+--------------------+--------------------+

Load pull requests data from a private Github repository.

>>> spark.read.format("github").option("token", "your-token").load("owner/repo").show()
Source code in pyspark_datasources/github.py
class GithubDataSource(DataSource):
    """
    A DataSource for reading pull requests data from Github.

    Name: `github`

    Schema: `id int, title string, author string, created_at string, updated_at string`

    Examples
    --------
    Register the data source.

    >>> from pyspark_datasources import GithubDataSource
    >>> spark.dataSource.register(GithubDataSource)

    Load pull requests data from a public Github repository.

    >>> spark.read.format("github").load("apache/spark").show()
    +---+--------------------+--------+--------------------+--------------------+
    | id|               title|  author|          created_at|          updated_at|
    +---+--------------------+--------+--------------------+--------------------+
    |  1|Initial commit      |  matei |2014-02-03T18:47:...|2014-02-03T18:47:...|
    |...|                 ...|     ...|                 ...|                 ...|
    +---+--------------------+--------+--------------------+--------------------+

    Load pull requests data from a private Github repository.

    >>> spark.read.format("github").option("token", "your-token").load("owner/repo").show()
    """

    @classmethod
    def name(self):
        return "github"

    def schema(self):
        return "id int, title string, author string, created_at string, updated_at string"

    def reader(self, schema):
        return GithubPullRequestReader(self.options)