Skip to content

FakeDataSource

Requires the Faker library. You can install it manually: pip install faker or use pip install pyspark-data-sources[faker].

Bases: DataSource

A fake data source for PySpark to generate synthetic data using the faker library.

This data source allows specifying a schema with field names that correspond to faker providers to generate random data for testing and development purposes.

The default schema is name string, date string, zipcode string, state string, and the default number of rows is 3. Both can be customized by users.

Name: fake

Notes
  • The fake data source relies on the faker library. Make sure it is installed and accessible.
  • Only string type fields are supported, and each field name must correspond to a method name in the faker library.
  • When using the stream reader, numRows is the number of rows per microbatch.

Examples:

Register the data source.

>>> from pyspark_datasources import FakeDataSource
>>> spark.dataSource.register(FakeDataSource)

Use the fake datasource with the default schema and default number of rows:

>>> spark.read.format("fake").load().show()
+-----------+----------+-------+-------+
|       name|      date|zipcode|  state|
+-----------+----------+-------+-------+
|Carlos Cobb|2018-07-15|  73003|Indiana|
| Eric Scott|1991-08-22|  10085|  Idaho|
| Amy Martin|1988-10-28|  68076| Oregon|
+-----------+----------+-------+-------+

Use the fake datasource with a custom schema:

>>> spark.read.format("fake").schema("name string, company string").load().show()
+---------------------+--------------+
|name                 |company       |
+---------------------+--------------+
|Tanner Brennan       |Adams Group   |
|Leslie Maxwell       |Santiago Group|
|Mrs. Jacqueline Brown|Maynard Inc   |
+---------------------+--------------+

Use the fake datasource with a different number of rows:

>>> spark.read.format("fake").option("numRows", 5).load().show()
+--------------+----------+-------+------------+
|          name|      date|zipcode|       state|
+--------------+----------+-------+------------+
|  Pam Mitchell|1988-10-20|  23788|   Tennessee|
|Melissa Turner|1996-06-14|  30851|      Nevada|
|  Brian Ramsey|2021-08-21|  55277|  Washington|
|  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
| Douglas James|2007-01-18|  46226|     Alabama|
+--------------+----------+-------+------------+

Streaming fake data:

>>> stream = spark.readStream.format("fake").load().writeStream.format("console").start()
Batch: 0
+--------------+----------+-------+------------+
|          name|      date|zipcode|       state|
+--------------+----------+-------+------------+
|    Tommy Diaz|1976-11-17|  27627|South Dakota|
|Jonathan Perez|1986-02-23|  81307|Rhode Island|
|  Julia Farmer|1990-10-10|  40482|    Virginia|
+--------------+----------+-------+------------+
Batch: 1
...
>>> stream.stop()
Source code in pyspark_datasources/fake.py
class FakeDataSource(DataSource):
    """
    A fake data source for PySpark to generate synthetic data using the `faker` library.

    This data source allows specifying a schema with field names that correspond to `faker`
    providers to generate random data for testing and development purposes.

    The default schema is `name string, date string, zipcode string, state string`, and the
    default number of rows is `3`. Both can be customized by users.

    Name: `fake`

    Notes
    -----
    - The fake data source relies on the `faker` library. Make sure it is installed and accessible.
    - Only string type fields are supported, and each field name must correspond to a method name in
      the `faker` library.
    - When using the stream reader, `numRows` is the number of rows per microbatch.

    Examples
    --------
    Register the data source.

    >>> from pyspark_datasources import FakeDataSource
    >>> spark.dataSource.register(FakeDataSource)

    Use the fake datasource with the default schema and default number of rows:

    >>> spark.read.format("fake").load().show()
    +-----------+----------+-------+-------+
    |       name|      date|zipcode|  state|
    +-----------+----------+-------+-------+
    |Carlos Cobb|2018-07-15|  73003|Indiana|
    | Eric Scott|1991-08-22|  10085|  Idaho|
    | Amy Martin|1988-10-28|  68076| Oregon|
    +-----------+----------+-------+-------+

    Use the fake datasource with a custom schema:

    >>> spark.read.format("fake").schema("name string, company string").load().show()
    +---------------------+--------------+
    |name                 |company       |
    +---------------------+--------------+
    |Tanner Brennan       |Adams Group   |
    |Leslie Maxwell       |Santiago Group|
    |Mrs. Jacqueline Brown|Maynard Inc   |
    +---------------------+--------------+

    Use the fake datasource with a different number of rows:

    >>> spark.read.format("fake").option("numRows", 5).load().show()
    +--------------+----------+-------+------------+
    |          name|      date|zipcode|       state|
    +--------------+----------+-------+------------+
    |  Pam Mitchell|1988-10-20|  23788|   Tennessee|
    |Melissa Turner|1996-06-14|  30851|      Nevada|
    |  Brian Ramsey|2021-08-21|  55277|  Washington|
    |  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
    | Douglas James|2007-01-18|  46226|     Alabama|
    +--------------+----------+-------+------------+

    Streaming fake data:

    >>> stream = spark.readStream.format("fake").load().writeStream.format("console").start()
    Batch: 0
    +--------------+----------+-------+------------+
    |          name|      date|zipcode|       state|
    +--------------+----------+-------+------------+
    |    Tommy Diaz|1976-11-17|  27627|South Dakota|
    |Jonathan Perez|1986-02-23|  81307|Rhode Island|
    |  Julia Farmer|1990-10-10|  40482|    Virginia|
    +--------------+----------+-------+------------+
    Batch: 1
    ...
    >>> stream.stop()
    """

    @classmethod
    def name(cls):
        return "fake"

    def schema(self):
        return "name string, date string, zipcode string, state string"

    def reader(self, schema: StructType) -> "FakeDataSourceReader":
        _validate_faker_schema(schema)
        return FakeDataSourceReader(schema, self.options)

    def streamReader(self, schema) -> "FakeDataSourceStreamReader":
        _validate_faker_schema(schema)
        return FakeDataSourceStreamReader(schema, self.options)