Introduction

In this section we'll explore the generation of synthetic data using SDV library. The SDV offers multiple machine learning models ranging from classical statistical methods (Copulas) to deep learning methods (GANs). Synthetic Data is very important for a number of reasons:

Software Testing
Access Expansion
Pilot New Products
Augmented Data
Plan scenarios

Personal Identifiable Information (PII)

One main issue with data is its sensitivity. So we can define data columns as PII - Personal Identifiable Information - or non-PII. According to the University of Pittsburgh:

Personally Identifiable Information (PII) includes:

Any information that can be used to distinguish or trace an individual’s identity, such as name, social security number, date and place of birth, mother’s maiden name, or biometric records; and

Any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information.

Examples of PII include, but are not limited to:

Name: full name, maiden name, mother’s maiden name, or alias

Personal identification numbers: social security number (SSN), passport number, driver’s license number, taxpayer identification number, patient identification number, financial account number, or credit card number

Personal address information: street address, or email address

Personal telephone numbers

Personal characteristics: photographic images (particularly of face or other identifying characteristics), fingerprints, or handwriting

Biometric data: retina scans, voice signatures, or facial geometry

Information identifying personally owned property: VIN number or title number

Asset information: Internet Protocol (IP) or Media Access Control (MAC) addresses that consistently link to a particular person

Source

Requirements

To install the SDV library we should be working with Python >= 3.10 and < 3.11. For the analysis done on this library we used Python 3.10.9.

pip install sdv

Flow

Usually the flow of generating synthetic data is the following:

flowchart TD
    subgraph Synthetic Data Flow
        ld[Load Data]
        cm[Create Metadata]
        em[Edit Metadata]
        cs[Create Synthesizer]
        ts[Train Synthesizer]
        gsd[Generate Synthetic Data]
        ld-->cm
        cm-->em
        em-->cs
        cs-->ts
        ts-->gsd
    end

Load Data

At first we need to load the data. If it's stored in a csv format we can use the built-in load_csvs method. This method reads all the csv's available in that particular folder.

from sdv.datasets.local import load_csvs

# assume that my_folder contains 1 CSV file named 'guests.csv'
datasets = load_csvs(folder_name='my_folder/')

Since SDV uses Pandas' dataframe under the hood, we can use it directly to load the data.

import pandas as pd

data = pd.read_excel('file://localhost/path/to/table.xlsx')

Create Metadata

Metadata is an object which contains the skeleton of our data, mainly types of columns, keys, etc. On a second step we should create this metadata based on the data loaded previously. For this we have a method in SDV library called detect_from_dataframe. At this stage we must tell SDV whether we are trying to generate a multi table or single table metadata.

from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=data)

Edit Metadata

After this, a metadata object is created. It's strongly advised that this object should be checked and edited if need be. Usually the auto detection only gets the types and some may be wrong. For example, if a column flag has the values 0 or 1 it will say it is numerical even though the correct type is most probably boolean. To edit we should use the methods update_column, set_primary_key and add_alternate_keys.

For these methods we have a parameter called sdtype which sets the type of the column. These types are provided from the Faker Python Library. The most common ones are the following:

Boolean: Sdtype boolean describes columns that contain TRUE or FALSE values and may contain some missing data.
Categorical: Sdtype categorical describes columns that contain distinct categories. The defining aspect of a categorical column is that only the values that appear in the real data are valid. The categories may be ordered or unordered.
Datetime: Sdtype datetime describes columns that indicate a point of time. This can be at any granularity: to the nearest day, minute, second or even nanosecond. Typically, the datetime will be represented as a string.
Numerical: Sdtype numerical describes data with numbers. The defining aspect of numerical data is that there is an order and you can apply a variety of mathematical computations to the values (average, sum, etc.) The actual values may follow a specific format, such as being rounded to 2 decimal digits and remaining between min/max bounds.
ID: Sdtype id describes columns that are used to identify rows (eg. as a primary or foreign key). ID columns do not have any other mathematical or special meanings. Typically, an ID column follows a particular structure, for example being exactly 8 digits long with a - in the middle.

Also, note that this object can be saved or loaded locally.

metadata.update_column(
    column_name='room_type',
    sdtype='categorical')

metadata.set_primary_key(column_name='guest_email')

metadata.add_alternate_keys(column_names=['credit_card_number'])

metadata.save_to_json(filepath='my_metadata_v1.json')

# Needs to import SingleTableMetadata
metadata_obj = SingleTableMetadata.load_from_dict(metadata_dict)

Create Synthesizer

The synthesizer is the tool that uses machine learning to understand your data and create synthetic data based on it. There are several different models of synthesizers. For single tables we have:

GaussianCopulaSynthesizer
Fast ML Preset which is a preset that uses the GaussianCopulaSynthesizer in background.
CTGANSynthesizer
TVAESynthesizer
CopulaGANSynthesizer

For multi tables we have: - HMASynthesizer (note that we can set the synthesizer used for each table on a multitable synthesizer)

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCouplaSynthesizer(metadata)

Train Synthesizer

To train the synthesizer we use the method fit.

synthesizer.fit(real_data)

You can also save and load the synthesizer.

Generate Synthetic Data

To generate synthetic data with our synthesizer we should use the method sample.

synthetic_data = synthesizer.sample(num_rows=100)

It's also possible to generate conditional synthetic data.

from sdv.sampling import Condition

suite_guests_with_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': True}
)

suite_guests_without_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': False}
)

synthetic_data = custom_synthesizer.sample_from_conditions(
    conditions=[suite_guests_with_rewards, suite_guests_without_rewards],
    output_file_path='synthetic_simulated_scenario.csv'
)

Now that we know the flow of usage of the SDV library, it's important to mention that all of these steps have more options to them not shown here. For those you can check the official documentation.

Evaluation

One other tool that the SDV library provides is an evaluation module which we can use to compare the newly synthetic data with the real data. This is very helpful in order to check the quality of our synthetic data and decide whether to use it or not.

Metrics

To study the performance of this library methods (fit - to train - and sample - to create data) we used a Jupyter Notebook on VSCode using a M1 MacBook Air with 16GB of RAM. The results are the following:

Synthesizer model	Method	Num of rows used/generated	Time
Gaussian - FastML	fit	5000	0.1s
Gaussian - FastML	sample	100000	0.9s
Gaussian - FastML	sample	1000000	8.0s
Gaussian	fit	5000	1.2s
Gaussian	sample	100000	1.9s
Gaussian	sample	1000000	17.4s
CTGAN	fit	5000	1min 21s
CTGAN	sample	100000	2.3s
CTGAN	sample	1000000	22.1s
TVAE	fit	5000	28.1s
TVAE	sample	100000	1.3s
TVAE	sample	1000000	13s
CopulaGAN	fit	5000	1min 31s
CopulaGAN	sample	100000	2.8s
CopulaGAN	sample	1000000	28.3s

Notebooks

There are 2 notebooks available that showcase the functionality of the SDV library for the single table and multi table workflow.