Bases: DataSource
A fake data source for PySpark to generate synthetic data using the faker
library.
This data source allows specifying a schema with field names that correspond to faker
providers to generate random data for testing and development purposes.
The default schema is name string, date string, zipcode string, state string
, and the
default number of rows is 3
. Both can be customized by users.
Name: fake
Notes
- The fake data source relies on the
faker
library. Make sure it is installed and accessible.
- Only string type fields are supported, and each field name must correspond to a method name in
the
faker
library.
- When using the stream reader,
numRows
is the number of rows per microbatch.
Examples:
Register the data source.
>>> from pyspark_datasources import FakeDataSource
>>> spark.dataSource.register(FakeDataSource)
Use the fake datasource with the default schema and default number of rows:
>>> spark.read.format("fake").load().show()
+-----------+----------+-------+-------+
| name| date|zipcode| state|
+-----------+----------+-------+-------+
|Carlos Cobb|2018-07-15| 73003|Indiana|
| Eric Scott|1991-08-22| 10085| Idaho|
| Amy Martin|1988-10-28| 68076| Oregon|
+-----------+----------+-------+-------+
Use the fake datasource with a custom schema:
>>> spark.read.format("fake").schema("name string, company string").load().show()
+---------------------+--------------+
|name |company |
+---------------------+--------------+
|Tanner Brennan |Adams Group |
|Leslie Maxwell |Santiago Group|
|Mrs. Jacqueline Brown|Maynard Inc |
+---------------------+--------------+
Use the fake datasource with a different number of rows:
>>> spark.read.format("fake").option("numRows", 5).load().show()
+--------------+----------+-------+------------+
| name| date|zipcode| state|
+--------------+----------+-------+------------+
| Pam Mitchell|1988-10-20| 23788| Tennessee|
|Melissa Turner|1996-06-14| 30851| Nevada|
| Brian Ramsey|2021-08-21| 55277| Washington|
| Caitlin Reed|1983-06-22| 89813|Pennsylvania|
| Douglas James|2007-01-18| 46226| Alabama|
+--------------+----------+-------+------------+
Streaming fake data:
>>> stream = spark.readStream.format("fake").load().writeStream.format("console").start()
Batch: 0
+--------------+----------+-------+------------+
| name| date|zipcode| state|
+--------------+----------+-------+------------+
| Tommy Diaz|1976-11-17| 27627|South Dakota|
|Jonathan Perez|1986-02-23| 81307|Rhode Island|
| Julia Farmer|1990-10-10| 40482| Virginia|
+--------------+----------+-------+------------+
Batch: 1
...
>>> stream.stop()
Source code in pyspark_datasources/fake.py
| class FakeDataSource(DataSource):
"""
A fake data source for PySpark to generate synthetic data using the `faker` library.
This data source allows specifying a schema with field names that correspond to `faker`
providers to generate random data for testing and development purposes.
The default schema is `name string, date string, zipcode string, state string`, and the
default number of rows is `3`. Both can be customized by users.
Name: `fake`
Notes
-----
- The fake data source relies on the `faker` library. Make sure it is installed and accessible.
- Only string type fields are supported, and each field name must correspond to a method name in
the `faker` library.
- When using the stream reader, `numRows` is the number of rows per microbatch.
Examples
--------
Register the data source.
>>> from pyspark_datasources import FakeDataSource
>>> spark.dataSource.register(FakeDataSource)
Use the fake datasource with the default schema and default number of rows:
>>> spark.read.format("fake").load().show()
+-----------+----------+-------+-------+
| name| date|zipcode| state|
+-----------+----------+-------+-------+
|Carlos Cobb|2018-07-15| 73003|Indiana|
| Eric Scott|1991-08-22| 10085| Idaho|
| Amy Martin|1988-10-28| 68076| Oregon|
+-----------+----------+-------+-------+
Use the fake datasource with a custom schema:
>>> spark.read.format("fake").schema("name string, company string").load().show()
+---------------------+--------------+
|name |company |
+---------------------+--------------+
|Tanner Brennan |Adams Group |
|Leslie Maxwell |Santiago Group|
|Mrs. Jacqueline Brown|Maynard Inc |
+---------------------+--------------+
Use the fake datasource with a different number of rows:
>>> spark.read.format("fake").option("numRows", 5).load().show()
+--------------+----------+-------+------------+
| name| date|zipcode| state|
+--------------+----------+-------+------------+
| Pam Mitchell|1988-10-20| 23788| Tennessee|
|Melissa Turner|1996-06-14| 30851| Nevada|
| Brian Ramsey|2021-08-21| 55277| Washington|
| Caitlin Reed|1983-06-22| 89813|Pennsylvania|
| Douglas James|2007-01-18| 46226| Alabama|
+--------------+----------+-------+------------+
Streaming fake data:
>>> stream = spark.readStream.format("fake").load().writeStream.format("console").start()
Batch: 0
+--------------+----------+-------+------------+
| name| date|zipcode| state|
+--------------+----------+-------+------------+
| Tommy Diaz|1976-11-17| 27627|South Dakota|
|Jonathan Perez|1986-02-23| 81307|Rhode Island|
| Julia Farmer|1990-10-10| 40482| Virginia|
+--------------+----------+-------+------------+
Batch: 1
...
>>> stream.stop()
"""
@classmethod
def name(cls):
return "fake"
def schema(self):
return "name string, date string, zipcode string, state string"
def reader(self, schema: StructType) -> "FakeDataSourceReader":
_validate_faker_schema(schema)
return FakeDataSourceReader(schema, self.options)
def streamReader(self, schema) -> "FakeDataSourceStreamReader":
_validate_faker_schema(schema)
return FakeDataSourceStreamReader(schema, self.options)
|