Skip to content

FakeDataSource

Requires the Faker library. You can install it manually: pip install faker or use pip install pyspark-data-sources[faker].

Bases: DataSource

A fake data source for PySpark to generate synthetic data using the faker library.

This data source allows specifying a schema with field names that correspond to faker providers to generate random data for testing and development purposes.

The default schema is name string, date string, zipcode string, state string, and the default number of rows is 3. Both can be customized by users.

Name: fake

Notes
  • The fake data source relies on the faker library. Make sure it is installed and accessible.
  • Only string type fields are supported, and each field name must correspond to a method name in the faker library.

Examples:

Register the data source.

>>> from pyspark_datasources import FakeDataSource
>>> spark.dataSource.register(FakeDataSource)

Use the fake datasource with the default schema and default number of rows:

>>> spark.read.format("fake").load().show()
+-----------+----------+-------+-------+
|       name|      date|zipcode|  state|
+-----------+----------+-------+-------+
|Carlos Cobb|2018-07-15|  73003|Indiana|
| Eric Scott|1991-08-22|  10085|  Idaho|
| Amy Martin|1988-10-28|  68076| Oregon|
+-----------+----------+-------+-------+

Use the fake datasource with a custom schema:

>>> spark.read.format("fake").schema("name string, company string").load().show()
+---------------------+--------------+
|name                 |company       |
+---------------------+--------------+
|Tanner Brennan       |Adams Group   |
|Leslie Maxwell       |Santiago Group|
|Mrs. Jacqueline Brown|Maynard Inc   |
+---------------------+--------------+

Use the fake datasource with a different number of rows:

>>> spark.read.format("fake").option("numRows", 5).load().show()
+--------------+----------+-------+------------+
|          name|      date|zipcode|       state|
+--------------+----------+-------+------------+
|  Pam Mitchell|1988-10-20|  23788|   Tennessee|
|Melissa Turner|1996-06-14|  30851|      Nevada|
|  Brian Ramsey|2021-08-21|  55277|  Washington|
|  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
| Douglas James|2007-01-18|  46226|     Alabama|
+--------------+----------+-------+------------+
Source code in pyspark_datasources/fake.py
class FakeDataSource(DataSource):
    """
    A fake data source for PySpark to generate synthetic data using the `faker` library.

    This data source allows specifying a schema with field names that correspond to `faker`
    providers to generate random data for testing and development purposes.

    The default schema is `name string, date string, zipcode string, state string`, and the
    default number of rows is `3`. Both can be customized by users.

    Name: `fake`

    Notes
    -----
    - The fake data source relies on the `faker` library. Make sure it is installed and accessible.
    - Only string type fields are supported, and each field name must correspond to a method name in
      the `faker` library.

    Examples
    --------
    Register the data source.

    >>> from pyspark_datasources import FakeDataSource
    >>> spark.dataSource.register(FakeDataSource)

    Use the fake datasource with the default schema and default number of rows:

    >>> spark.read.format("fake").load().show()
    +-----------+----------+-------+-------+
    |       name|      date|zipcode|  state|
    +-----------+----------+-------+-------+
    |Carlos Cobb|2018-07-15|  73003|Indiana|
    | Eric Scott|1991-08-22|  10085|  Idaho|
    | Amy Martin|1988-10-28|  68076| Oregon|
    +-----------+----------+-------+-------+

    Use the fake datasource with a custom schema:

    >>> spark.read.format("fake").schema("name string, company string").load().show()
    +---------------------+--------------+
    |name                 |company       |
    +---------------------+--------------+
    |Tanner Brennan       |Adams Group   |
    |Leslie Maxwell       |Santiago Group|
    |Mrs. Jacqueline Brown|Maynard Inc   |
    +---------------------+--------------+

    Use the fake datasource with a different number of rows:

    >>> spark.read.format("fake").option("numRows", 5).load().show()
    +--------------+----------+-------+------------+
    |          name|      date|zipcode|       state|
    +--------------+----------+-------+------------+
    |  Pam Mitchell|1988-10-20|  23788|   Tennessee|
    |Melissa Turner|1996-06-14|  30851|      Nevada|
    |  Brian Ramsey|2021-08-21|  55277|  Washington|
    |  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
    | Douglas James|2007-01-18|  46226|     Alabama|
    +--------------+----------+-------+------------+
    """

    @classmethod
    def name(cls):
        return "fake"

    def schema(self):
        return "name string, date string, zipcode string, state string"

    def reader(self, schema: StructType):
        # Verify the library is installed correctly.
        try:
            from faker import Faker
        except ImportError:
            raise Exception("You need to install `faker` to use the fake datasource.")

        # Check the schema is valid before proceed to reading.
        fake = Faker()
        for field in schema.fields:
            try:
                getattr(fake, field.name)()
            except AttributeError:
                raise Exception(
                    f"Unable to find a method called `{field.name}` in faker. "
                    f"Please check Faker's documentation to see supported methods."
                )
            if field.dataType != StringType():
                raise Exception(
                    f"Field `{field.name}` is not a StringType. "
                    f"Only StringType is supported in the fake datasource."
                )

        return FakeDataSourceReader(schema, self.options)