Multi Table Synthetic Data¶

Data analysis¶

For multitable we'll be using the strokes table used on the single table analysis and a people table. For the people table each entry has an ID which matches the person_id on the strokes table. This table also has some PII (Personal Identifiable Information) information which is important for us to masquerade.

Load Data¶

First, we go to the content folder and get all csv files there.

In [2]:

Copied!





from sdv.datasets.local import load_csvs

try:
    datasets = load_csvs(folder_name='content/')
except ValueError as e:
    print(e)
from sdv.datasets.local import load_csvs

try:
    datasets = load_csvs(folder_name='content/')
except ValueError as e:
    print(e)

Then, we access both table and display the first 20 rows of the people table.

In [3]:

Copied!

print(datasets.keys())

strokes_table = datasets['strokes']
people_table = datasets['people']
people_table.head(20)
print(datasets.keys())

strokes_table = datasets['strokes']
people_table = datasets['people']
people_table.head(20)

dict_keys(['people', 'strokes'])

Out[3]:

	id	name	address	city
0	56420	Marcelo Holmes	49 Walt Whitman Lane	New York
1	51856	Aleena Hahn	Apple Valley, CA 92307	Los Angeles
2	41097	Jocelyn Hancock	10 West Church St.	Chicago
3	545	Marcel Underwood	Hastings, MN 55033	Miami
4	37759	Jazlyn Davila	7444 South Pine Dr.	Dallas
5	66333	Teagan Randall	Malden, MA 02148	Houston
6	70670	Antony Graham	666 Windfall Dr.	Philadelphia
7	20292	Evelyn Becker	Niagara Falls, NY 14304	Atlanta
8	72784	Arturo Dillon	334 Grove Street	Washington
9	65895	Alyssa Peters	Moncks Corner, SC 29461	Boston
10	5131	Damarion Colon	58 Canterbury Street	Phoenix
11	72911	Holden Mccarthy	Lake Jackson, TX 77566	Detroit
12	1307	Colten Costa	345 East Brandywine St.	Seattle
13	23047	Turner Mcdaniel	Halethorpe, MD 21227	San Francisco
14	32604	Jett Knox	9 Eagle Dr.	San Diego
15	63915	Talia Olson	Maumee, OH 43537	Minneapolis
16	25405	Kate Hale	59 Vale St.	Brooklyn
17	3590	Lilliana Warren	Tuckerton, NJ 08087	Tampa
18	2898	Ramon Dillon	95 Cross Ave.	Denver
19	60675	Ayanna Tyler	Chardon, OH 44024	Queens

Create metadata¶

We then need to create the metadata object to be used when creating the synthesizer. SDV will detect some information from the table content but it may not be correct. It's always best to check the metadata and fix whatever needs to be fixed.

In [4]:

Copied!





from sdv.metadata import MultiTableMetadata

metadata = MultiTableMetadata()

metadata.detect_table_from_dataframe(
    table_name='strokes',
    data=strokes_table
)

metadata.detect_table_from_dataframe(
    table_name='people',
    data=people_table
)

print('Auto detected data:\n')
metadata
from sdv.metadata import MultiTableMetadata

metadata = MultiTableMetadata()

metadata.detect_table_from_dataframe(
    table_name='strokes',
    data=strokes_table
)

metadata.detect_table_from_dataframe(
    table_name='people',
    data=people_table
)

print('Auto detected data:\n')
metadata

Auto detected data:

Out[4]:

{
    "tables": {
        "strokes": {
            "columns": {
                "id": {
                    "sdtype": "numerical"
                },
                "person_id": {
                    "sdtype": "numerical"
                },
                "gender": {
                    "sdtype": "categorical"
                },
                "age": {
                    "sdtype": "numerical"
                },
                "hypertension": {
                    "sdtype": "numerical"
                },
                "heart_disease": {
                    "sdtype": "numerical"
                },
                "ever_married": {
                    "sdtype": "categorical"
                },
                "work_type": {
                    "sdtype": "categorical"
                },
                "Residence_type": {
                    "sdtype": "categorical"
                },
                "avg_glucose_level": {
                    "sdtype": "numerical"
                },
                "bmi": {
                    "sdtype": "numerical"
                },
                "smoking_status": {
                    "sdtype": "categorical"
                },
                "stroke": {
                    "sdtype": "numerical"
                }
            }
        },
        "people": {
            "columns": {
                "id": {
                    "sdtype": "numerical"
                },
                "name": {
                    "sdtype": "categorical"
                },
                "address": {
                    "sdtype": "categorical"
                },
                "city": {
                    "sdtype": "categorical"
                }
            }
        }
    },
    "relationships": [],
    "METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

Edit Metadata¶

Strokes Table¶

Bellow, we make a few changes such as:

Change column id to type id;
Change column age to a numerical type of integers;
Change column bmi to a numerical type of floats;
Change column stroke to a categorical column since the information is either 0 or 1;
Change hypertension and heart_disease columns to categorical as well for the same reason.

In [5]:

Copied!





metadata.update_column(
    table_name='strokes',
    column_name='id',
    sdtype='id'
)

metadata.update_column(
    table_name='strokes',
    column_name='person_id',
    sdtype='id'
)

metadata.update_column(
    table_name='strokes',
    column_name='age',
    sdtype='numerical',
    computer_representation="Int64"
)

metadata.update_column(
    table_name='strokes',
    column_name='bmi',
    sdtype='numerical',
    computer_representation="Float"
)

metadata.update_column(
    table_name='strokes',
    column_name='stroke',
    sdtype='categorical',
)

metadata.update_column(
    table_name='strokes',
    column_name='hypertension',
    sdtype='categorical',
)

metadata.update_column(
    table_name='strokes',
    column_name='heart_disease',
    sdtype='categorical',
)

print(metadata)
metadata.update_column(
    table_name='strokes',
    column_name='id',
    sdtype='id'
)

metadata.update_column(
    table_name='strokes',
    column_name='person_id',
    sdtype='id'
)

metadata.update_column(
    table_name='strokes',
    column_name='age',
    sdtype='numerical',
    computer_representation="Int64"
)

metadata.update_column(
    table_name='strokes',
    column_name='bmi',
    sdtype='numerical',
    computer_representation="Float"
)

metadata.update_column(
    table_name='strokes',
    column_name='stroke',
    sdtype='categorical',
)

metadata.update_column(
    table_name='strokes',
    column_name='hypertension',
    sdtype='categorical',
)

metadata.update_column(
    table_name='strokes',
    column_name='heart_disease',
    sdtype='categorical',
)

print(metadata)

{
    "tables": {
        "strokes": {
            "columns": {
                "id": {
                    "sdtype": "id"
                },
                "person_id": {
                    "sdtype": "id"
                },
                "gender": {
                    "sdtype": "categorical"
                },
                "age": {
                    "sdtype": "numerical",
                    "computer_representation": "Int64"
                },
                "hypertension": {
                    "sdtype": "categorical"
                },
                "heart_disease": {
                    "sdtype": "categorical"
                },
                "ever_married": {
                    "sdtype": "categorical"
                },
                "work_type": {
                    "sdtype": "categorical"
                },
                "Residence_type": {
                    "sdtype": "categorical"
                },
                "avg_glucose_level": {
                    "sdtype": "numerical"
                },
                "bmi": {
                    "sdtype": "numerical",
                    "computer_representation": "Float"
                },
                "smoking_status": {
                    "sdtype": "categorical"
                },
                "stroke": {
                    "sdtype": "categorical"
                }
            }
        },
        "people": {
            "columns": {
                "id": {
                    "sdtype": "numerical"
                },
                "name": {
                    "sdtype": "categorical"
                },
                "address": {
                    "sdtype": "categorical"
                },
                "city": {
                    "sdtype": "categorical"
                }
            }
        }
    },
    "relationships": [],
    "METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

People Table¶

Bellow, we make a few changes to people type such as:

Change column id to type id;
Change column name to type name and mark as PII;
Change column address to type address and mark as PII;
Change column city to type city and mark as PII.

We need to add pii to the columns because they are personable identifiable information.

In [6]:

Copied!





metadata.update_column(
    table_name='people',
    column_name='id',
    sdtype='id'
)

metadata.update_column(
    table_name='people',
    column_name='name',
    sdtype='name',
    pii=True
)

metadata.update_column(
    table_name='people',
    column_name='address',
    sdtype='address',
    pii=True
)

metadata.update_column(
    table_name='people',
    column_name='city',
    sdtype='city',
    pii=True
)

print(metadata)
metadata.update_column(
    table_name='people',
    column_name='id',
    sdtype='id'
)

metadata.update_column(
    table_name='people',
    column_name='name',
    sdtype='name',
    pii=True
)

metadata.update_column(
    table_name='people',
    column_name='address',
    sdtype='address',
    pii=True
)

metadata.update_column(
    table_name='people',
    column_name='city',
    sdtype='city',
    pii=True
)

print(metadata)

{
    "tables": {
        "strokes": {
            "columns": {
                "id": {
                    "sdtype": "id"
                },
                "person_id": {
                    "sdtype": "id"
                },
                "gender": {
                    "sdtype": "categorical"
                },
                "age": {
                    "sdtype": "numerical",
                    "computer_representation": "Int64"
                },
                "hypertension": {
                    "sdtype": "categorical"
                },
                "heart_disease": {
                    "sdtype": "categorical"
                },
                "ever_married": {
                    "sdtype": "categorical"
                },
                "work_type": {
                    "sdtype": "categorical"
                },
                "Residence_type": {
                    "sdtype": "categorical"
                },
                "avg_glucose_level": {
                    "sdtype": "numerical"
                },
                "bmi": {
                    "sdtype": "numerical",
                    "computer_representation": "Float"
                },
                "smoking_status": {
                    "sdtype": "categorical"
                },
                "stroke": {
                    "sdtype": "categorical"
                }
            }
        },
        "people": {
            "columns": {
                "id": {
                    "sdtype": "id"
                },
                "name": {
                    "sdtype": "name",
                    "pii": true
                },
                "address": {
                    "sdtype": "address",
                    "pii": true
                },
                "city": {
                    "sdtype": "city",
                    "pii": true
                }
            }
        }
    },
    "relationships": [],
    "METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

Then we need to connect the two tables

In [7]:

Copied!





metadata.set_primary_key(
    table_name='strokes',
    column_name='id'
)

metadata.set_primary_key(
    table_name='people',
    column_name='id'
)

metadata.add_relationship(
    parent_table_name='people',
    child_table_name='strokes',
    parent_primary_key='id',
    child_foreign_key='person_id'
)

print(metadata)
metadata.set_primary_key(
    table_name='strokes',
    column_name='id'
)

metadata.set_primary_key(
    table_name='people',
    column_name='id'
)

metadata.add_relationship(
    parent_table_name='people',
    child_table_name='strokes',
    parent_primary_key='id',
    child_foreign_key='person_id'
)

print(metadata)

{
    "tables": {
        "strokes": {
            "primary_key": "id",
            "columns": {
                "id": {
                    "sdtype": "id"
                },
                "person_id": {
                    "sdtype": "id"
                },
                "gender": {
                    "sdtype": "categorical"
                },
                "age": {
                    "sdtype": "numerical",
                    "computer_representation": "Int64"
                },
                "hypertension": {
                    "sdtype": "categorical"
                },
                "heart_disease": {
                    "sdtype": "categorical"
                },
                "ever_married": {
                    "sdtype": "categorical"
                },
                "work_type": {
                    "sdtype": "categorical"
                },
                "Residence_type": {
                    "sdtype": "categorical"
                },
                "avg_glucose_level": {
                    "sdtype": "numerical"
                },
                "bmi": {
                    "sdtype": "numerical",
                    "computer_representation": "Float"
                },
                "smoking_status": {
                    "sdtype": "categorical"
                },
                "stroke": {
                    "sdtype": "categorical"
                }
            }
        },
        "people": {
            "primary_key": "id",
            "columns": {
                "id": {
                    "sdtype": "id"
                },
                "name": {
                    "sdtype": "name",
                    "pii": true
                },
                "address": {
                    "sdtype": "address",
                    "pii": true
                },
                "city": {
                    "sdtype": "city",
                    "pii": true
                }
            }
        }
    },
    "relationships": [
        {
            "parent_table_name": "people",
            "child_table_name": "strokes",
            "parent_primary_key": "id",
            "child_foreign_key": "person_id"
        }
    ],
    "METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}

Create Synthesizer¶

Having created the metadata object we then needed to create the synthesizer which will be trained to generate the synthetic data. Here we use the only possible (excluding the enterprise edition) synthesizer - HMA. Note that you can configure which synthesizer each table uses.

In [8]:

Copied!

from sdv.multi_table import HMASynthesizer

synthesizer = HMASynthesizer(metadata)
synthesizer.fit(datasets)
from sdv.multi_table import HMASynthesizer

synthesizer = HMASynthesizer(metadata)
synthesizer.fit(datasets)

We cannot define number of rows when using multitable but we can define a scale:

<1 : Shrink the data by the specified percentage. For example, 0.9 will create synthetic data that is roughly 90% of the size of the original data.;
=1 : Don't scale the data. The model will create synthetic data that is roughly the same size as the original data.;
>1 : Scale the data by the specified factor. For example, 2.5 will create synthetic data that is roughly 2.5x the size of the original data.;

In [9]:

Copied!

synthetic_data = synthesizer.sample(
    scale=2.5
)

synthetic_data
synthetic_data = synthesizer.sample(
    scale=2.5
)

synthetic_data

Out[9]:

{'people':           id                name   
 0          0       Brent Collins  \
 1          1         Carlos Mata   
 2          2  Mrs. Crystal Blair   
 3          3    Nathaniel Murphy   
 4          4    Shannon Mitchell   
 ...      ...                 ...   
 11995  11995     Misty Dominguez   
 11996  11996       Roberto Brown   
 11997  11997       Deborah Smith   
 11998  11998       William Mason   
 11999  11999          Todd Price   
 
                                                  address                city  
 0      6157 Clark Rest\nSouth Christophershire, MT 67360        New Johnfurt  
 1                       PSC 8737, Box 1740\nAPO AA 35924           Jacobland  
 2              922 Smith Union\nPort Shawnbury, DE 51299           Holdenton  
 3             3937 William Mount\nCohenborough, KY 49299         West Mariah  
 4              09073 Manning Vista\nGravesport, MT 89179   South Thomashaven  
 ...                                                  ...                 ...  
 11995             7945 Wright Ford\nMartinbury, AK 83553          New Steven  
 11996  335 Williams Mills Suite 707\nSarahland, GA 05413         Garciashire  
 11997           971 Timothy Coves\nPort Daniel, UT 86478  West Calvinchester  
 11998            499 Erica Drive\nLake Carolyn, PA 58154         Morganmouth  
 11999      13150 Pamela Walk\nNorth Joshuaside, ND 17432     East Dianeville  
 
 [12000 rows x 4 columns],
 'strokes':           id  person_id  gender  age  hypertension  heart_disease   
 0          0          0    Male   41             0              0  \
 1          1          1  Female   22             0              0   
 2          2          2    Male   42             1              1   
 3          3          3    Male   71             0              0   
 4          4          4  Female    2             0              0   
 ...      ...        ...     ...  ...           ...            ...   
 11995  11995      11995   Other   52             0              0   
 11996  11996      11996    Male   26             0              0   
 11997  11997      11997  Female   40             0              0   
 11998  11998      11998  Female   79             0              0   
 11999  11999      11999    Male   59             0              0   
 
       ever_married      work_type Residence_type  avg_glucose_level   bmi   
 0               No       children          Rural              68.03  27.1  \
 1              Yes   Never_worked          Urban             189.96  22.5   
 2               No       Govt_job          Rural              90.20  36.4   
 3              Yes   Never_worked          Rural             175.26  31.3   
 4               No   Never_worked          Rural              64.72  16.2   
 ...            ...            ...            ...                ...   ...   
 11995           No   Never_worked          Urban              62.05  29.4   
 11996          Yes       Govt_job          Urban             122.06  17.0   
 11997          Yes        Private          Urban             187.24  39.3   
 11998          Yes  Self-employed          Rural             147.39  27.7   
 11999          Yes       Govt_job          Urban              71.11  33.2   
 
         smoking_status  stroke  
 0         never smoked       0  
 1              Unknown       0  
 2      formerly smoked       1  
 3              Unknown       0  
 4              Unknown       0  
 ...                ...     ...  
 11995          Unknown       0  
 11996          Unknown       0  
 11997  formerly smoked       0  
 11998     never smoked       0  
 11999  formerly smoked       0  
 
 [12000 rows x 13 columns]}

If you search for this info on the original data you can see that it's not there since we marked name, address and city as PII.

Evaluation

In [9]:

Copied!





from sdv.evaluation.multi_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=datasets,
    synthetic_data=synthetic_data,
    metadata=metadata)
from sdv.evaluation.multi_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=datasets,
    synthetic_data=synthetic_data,
    metadata=metadata)

Creating report: 100%|██████████| 5/5 [00:00<00:00, 15.89it/s]

Overall Quality Score: 91.12%

Properties:
Column Shapes: 90.35%
Column Pair Trends: 83.01%
Parent Child Relationships: 100.0%

In [10]:

Copied!





from sdv.evaluation.multi_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=datasets,
    synthetic_data=synthetic_data,
    metadata=metadata)
from sdv.evaluation.multi_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=datasets,
    synthetic_data=synthetic_data,
    metadata=metadata)

Creating report: 100%|██████████| 4/4 [00:28<00:00,  7.15s/it]

DiagnosticResults:

SUCCESS:
✓ The synthetic data covers over 90% of the numerical ranges present in the real data
✓ The synthetic data covers over 90% of the categories present in the real data
✓ Over 90% of the synthetic rows are not copies of the real data
✓ The synthetic data follows over 90% of the min/max boundaries set by the real data

In [11]:

Copied!

diagnostic_report.get_results()
diagnostic_report.get_results()

Out[11]:

{'SUCCESS': ['The synthetic data covers over 90% of the numerical ranges present in the real data',
  'The synthetic data covers over 90% of the categories present in the real data',
  'Over 90% of the synthetic rows are not copies of the real data',
  'The synthetic data follows over 90% of the min/max boundaries set by the real data'],
 'WARNING': [],
 'DANGER': []}

In [12]:

Copied!

diagnostic_report.get_properties()
diagnostic_report.get_properties()

Out[12]:

{'Coverage': 0.9733928657693466, 'Synthesis': 1.0, 'Boundaries': 1.0}

PII Locales¶

It's also possible to generate data from certain languages on certain countries. For example if we want only french canadian names here's how to proceed.

We create a AnonymizedFaker which receives the provider and function names from the Python Faker Library and a locales array with possible language_countries. And then we retrain the synthesizer for our changes to be learned.

In [13]:

Copied!

from rdt.transformers.pii import AnonymizedFaker

synthesizer.update_transformers(table_name="people", column_name_to_transformer={
    'name': AnonymizedFaker(provider_name='person', function_name='name', locales=['fr_CA'])
})

synthesizer.fit(datasets)
from rdt.transformers.pii import AnonymizedFaker

synthesizer.update_transformers(table_name="people", column_name_to_transformer={
    'name': AnonymizedFaker(provider_name='person', function_name='name', locales=['fr_CA'])
})

synthesizer.fit(datasets)

/Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`.
  warnings.warn(msg, UserWarning)
/Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`.
  warnings.warn(msg, UserWarning)
/Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`.
  warnings.warn(msg, UserWarning)

Then we generate the data once again (this time a smaller sample). And we can see that the names on the people table are now french canadian names.

In [14]:

Copied!

synthesizer.sample(scale=0.01)
synthesizer.sample(scale=0.01)

Out[14]:

{'people':     id                         name   
 0    0           Emmanuelle Poirier  \
 1    1                  Maude Blais   
 2    2     Caroline Trottier-Lemire   
 3    3                Édouard Soucy   
 4    4       Alexis-Emmanuel Séguin   
 5    5             Pénélope Gervais   
 6    6   Dorothée Marcoux-Larivière   
 7    7                Julien Larose   
 8    8              Michèle Couture   
 9    9              Susanne St-Onge   
 10  10        Jeannine-Manon Daigle   
 11  11       Martin Gingras-Provost   
 12  12     Emmanuel Dionne-Rodrigue   
 13  13               Michèle Bérubé   
 14  14               Josette Bérubé   
 15  15      Juliette-Sylvie Bernier   
 16  16               Pauline Nadeau   
 17  17            Marcel Létourneau   
 18  18              Jacques Fortier   
 19  19                 Louis Lepage   
 20  20             Céline Arsenault   
 21  21           Xavier-Jean Larose   
 22  22       William Bédard-Germain   
 23  23                 Maude Lepage   
 24  24               Mathieu Chabot   
 25  25  Thomas Champagne-Morissette   
 26  26          Timothée Robitaille   
 27  27             Robert Arsenault   
 28  28           Henriette Lévesque   
 29  29                   Rémy Morin   
 30  30                Nathan Blouin   
 31  31        Alex Bernard-Rousseau   
 32  32                 Henri Bisson   
 33  33                Maxime Savard   
 34  34             Roger Robitaille   
 35  35          Yves Béland-Paradis   
 36  36       Tristan Roberge-Dumont   
 37  37                  Noël Dufour   
 38  38                 Aurore Lebel   
 39  39              Bertrand Lemire   
 40  40      Thomas Lemieux-Fournier   
 41  41             Bertrand Gilbert   
 42  42              Nicolas Lussier   
 43  43            Alexandre Lacroix   
 44  44              Thomas Gauthier   
 45  45             Olivia Beauchamp   
 46  46                Jules Boucher   
 47  47            Constance Leclerc   
 
                                               address                 city  
 0   6157 Clark Rest\nSouth Christophershire, MT 67360         New Johnfurt  
 1                    PSC 8737, Box 1740\nAPO AA 35924            Jacobland  
 2           922 Smith Union\nPort Shawnbury, DE 51299            Holdenton  
 3          3937 William Mount\nCohenborough, KY 49299          West Mariah  
 4           09073 Manning Vista\nGravesport, MT 89179    South Thomashaven  
 5   2619 White Fields Apt. 532\nSouth Amandaport, ...        New Taraville  
 6          23276 Billy Plains\nMartinezberg, LA 87469         Michaelshire  
 7    0704 Smith Walks Apt. 228\nMillerburgh, WY 89418          Danielburgh  
 8               521 Sara Street\nHectorside, DC 24587          Robertsland  
 9    174 Velasquez Court\nEast Jenniferport, ID 40915        Marvinchester  
 10               153 Yu Island\nPalmermouth, MS 89535       South Kimberly  
 11        48142 Timothy Summit\nShannonview, MI 72357       Gutierrezshire  
 12       6397 Melissa Circle\nPort Benjamin, DE 42448             New Jack  
 13          55144 Brooks Walk\nSouth Andrew, OK 36891            Nelsonton  
 14       9230 Garza Parks\nPort Nataliefurt, TX 81751            Martinton  
 15           61788 Freeman Mill\nMillerland, WI 27888         Lake Jeffrey  
 16        409 Hodges Street\nParrishborough, ME 97469            Port Cody  
 17             89649 Joseph Path\nRickyland, NC 57649          West Andrea  
 18  4685 Hawkins Haven Suite 592\nRiveraberg, KS 0...           Port Tanya  
 19    83913 Patricia Gardens\nEricksonville, IN 08631        Andersonshire  
 20          2086 Mahoney Unions\nEricaburgh, WV 53247           Port Katie  
 21  26457 Wendy Wells Apt. 903\nNorth Megan, OH 70806         North Thomas  
 22   33704 Tiffany Tunnel\nLake Anthonyfort, TX 93085           East Sarah  
 23            714 Wendy Estate\nJoseborough, KY 03379             Jameston  
 24                   Unit 5426 Box 7165\nDPO AP 37872  West Jacquelinefurt  
 25       9589 Elizabeth Springs\nKevinville, MI 72107           East Jason  
 26      336 Franklin Crossroad\nMelissaberg, KY 57917         Lake Melissa  
 27         5328 Lauren Valley\nNew Ryanport, TN 14377     South Karenville  
 28  106 Carney Views Suite 400\nGarciamouth, CO 56012            Smithview  
 29     599 Jason Ford Suite 531\nEast Tammy, OK 38545        Hamiltonshire  
 30  2642 Graham Plains Apt. 732\nMullenchester, MN...         West Lindsey  
 31      3418 Byrd Loop Suite 726\nWardhaven, MO 11008  New Kathleenborough  
 32  9012 Laura Viaduct Apt. 726\nGrahamland, MI 86340     North Glendafurt  
 33  2656 Mcmillan Wall\nEast Douglasborough, TX 75230          Brandonside  
 34             1191 David Rest\nEast Lauren, MO 75586           Brownmouth  
 35      0479 Jensen Alley\nChristophermouth, MO 57533     East Matthewbury  
 36                            USNS Ross\nFPO AE 38922        East Jennifer  
 37  533 Wang Junction Suite 546\nLake Michaelshire...         East Richard  
 38  115 Webb Springs Suite 300\nAngelaville, FL 58789     Port Heatherfort  
 39                   PSC 5539, Box 0143\nAPO AA 19955         Melissashire  
 40  69349 Lisa Mountains Apt. 851\nRobinsonfurt, C...             Wangport  
 41         76314 Hernandez Lock\nHolmestown, MO 88896       South Jilltown  
 42    390 John Orchard Apt. 594\nRobertside, MO 87023         South Robert  
 43                   Unit 8573 Box 9313\nDPO AE 22469          Schultzbury  
 44   144 Richard Fields Apt. 599\nWeissside, MO 19466      Christopherview  
 45        227 Doyle Islands\nTimothychester, NM 14247            Calebview  
 46  397 Thompson Springs Apt. 535\nDestinymouth, N...   South Mitchellberg  
 47   3376 Garrett Crescent\nBenjaminborough, MO 03106        Austinborough  ,
 'strokes':     id  person_id  gender  age  hypertension  heart_disease ever_married   
 0    0          0    Male   59             0              0          Yes  \
 1    1          1    Male   36             0              0          Yes   
 2    2          2    Male   33             0              0           No   
 3    3          3  Female   39             0              0          Yes   
 4    4          4    Male   65             0              0          Yes   
 5    5          5  Female   72             0              0          Yes   
 6    6          6    Male   65             0              0          Yes   
 7    7          7    Male   49             0              1           No   
 8    8          8  Female   75             0              0          Yes   
 9    9          9  Female   37             0              0          Yes   
 10  10         10    Male   36             0              0           No   
 11  11         11    Male   58             0              0          Yes   
 12  12         12  Female   60             0              0           No   
 13  13         13    Male   48             0              0           No   
 14  14         14    Male   78             1              1          Yes   
 15  15         15    Male   63             0              0          Yes   
 16  16         16    Male   11             1              0           No   
 17  17         17  Female   13             0              0           No   
 18  18         18  Female    3             0              0           No   
 19  19         19  Female   35             0              0          Yes   
 20  20         20  Female   18             0              0          Yes   
 21  21         21  Female   47             0              0          Yes   
 22  22         22    Male   33             1              0           No   
 23  23         23    Male   17             0              1           No   
 24  24         24  Female   52             0              0           No   
 25  25         25    Male   72             0              0          Yes   
 26  26         26    Male   35             0              0          Yes   
 27  27         27    Male   61             0              0           No   
 28  28         28  Female   71             0              0          Yes   
 29  29         29  Female   76             0              0          Yes   
 30  30         30   Other    2             0              0           No   
 31  31         31    Male   30             1              1           No   
 32  32         32  Female   29             0              0           No   
 33  33         33  Female   81             1              0          Yes   
 34  34         34    Male   10             0              1           No   
 35  35         35    Male   68             0              0           No   
 36  36         36  Female   21             0              0           No   
 37  37         37  Female   44             0              0          Yes   
 38  38         38  Female   64             0              0           No   
 39  39         39  Female   37             0              0          Yes   
 40  40         40   Other   63             0              0           No   
 41  41         41  Female   35             0              0          Yes   
 42  42         42    Male   11             0              0           No   
 43  43         43    Male   27             0              0          Yes   
 44  44         44  Female   82             0              0          Yes   
 45  45         45  Female   70             0              0           No   
 46  46         46  Female   81             0              0          Yes   
 47  47         47   Other    7             0              0           No   
 
         work_type Residence_type  avg_glucose_level   bmi   smoking_status   
 0         Private          Rural             105.99  26.2     never smoked  \
 1        children          Urban             124.18  24.8  formerly smoked   
 2         Private          Urban              56.59  21.5          Unknown   
 3   Self-employed          Urban              92.89  22.2     never smoked   
 4        children          Rural             149.89  31.9          Unknown   
 5   Self-employed          Rural             134.48  23.2     never smoked   
 6        children          Urban             139.36  34.5           smokes   
 7        children          Rural             167.76  31.2  formerly smoked   
 8        children          Urban             102.79  37.2     never smoked   
 9         Private          Urban              87.34  35.7          Unknown   
 10        Private          Urban              63.28  18.4     never smoked   
 11       children          Rural             205.38  28.4          Unknown   
 12        Private          Urban             132.14  23.3           smokes   
 13  Self-employed          Urban              81.61  14.0     never smoked   
 14       children          Rural             114.62  30.1           smokes   
 15  Self-employed          Urban             137.29  37.7           smokes   
 16   Never_worked          Urban             135.24  20.4          Unknown   
 17        Private          Urban              87.55  24.0     never smoked   
 18        Private          Rural              55.88  24.1          Unknown   
 19  Self-employed          Urban              56.21  38.0     never smoked   
 20       children          Urban              94.29  21.9          Unknown   
 21        Private          Urban             147.00  36.8  formerly smoked   
 22       children          Rural             127.40  23.7  formerly smoked   
 23       Govt_job          Urban              57.79  19.4     never smoked   
 24       children          Rural             121.97  38.2  formerly smoked   
 25  Self-employed          Urban              57.00  15.8     never smoked   
 26       children          Rural             212.33  54.7     never smoked   
 27  Self-employed          Rural              88.19  35.0           smokes   
 28        Private          Urban              98.38  32.7     never smoked   
 29  Self-employed          Urban              73.53  32.4     never smoked   
 30   Never_worked          Rural             182.09  21.4           smokes   
 31       children          Rural             117.16  23.3           smokes   
 32       Govt_job          Rural             121.17  22.7     never smoked   
 33        Private          Rural             202.05  32.9     never smoked   
 34  Self-employed          Urban              62.05  18.7          Unknown   
 35        Private          Rural              71.35  32.8     never smoked   
 36       children          Urban             171.71  26.4     never smoked   
 37  Self-employed          Rural             123.41  29.3           smokes   
 38        Private          Urban              78.75  21.8     never smoked   
 39   Never_worked          Rural              69.86  19.8          Unknown   
 40  Self-employed          Rural             126.36  23.7     never smoked   
 41  Self-employed          Urban             155.57  46.4           smokes   
 42        Private          Urban              75.05  22.3          Unknown   
 43  Self-employed          Urban              77.96  30.5     never smoked   
 44        Private          Urban             113.75  26.0  formerly smoked   
 45  Self-employed          Urban             105.87  41.3           smokes   
 46  Self-employed          Rural             220.69  29.4     never smoked   
 47       children          Rural             103.11  23.4          Unknown   
 
     stroke  
 0        0  
 1        0  
 2        0  
 3        0  
 4        0  
 5        0  
 6        0  
 7        0  
 8        0  
 9        0  
 10       0  
 11       0  
 12       0  
 13       0  
 14       0  
 15       0  
 16       0  
 17       0  
 18       0  
 19       0  
 20       0  
 21       0  
 22       1  
 23       1  
 24       0  
 25       0  
 26       0  
 27       0  
 28       0  
 29       0  
 30       1  
 31       1  
 32       0  
 33       0  
 34       0  
 35       1  
 36       1  
 37       0  
 38       0  
 39       0  
 40       0  
 41       0  
 42       0  
 43       0  
 44       0  
 45       0  
 46       0  
 47       0  }

Conclusion¶

This time on the multi table generation we got a better coverage percentage on the quality report. This may be because of:

Having 2.5x the synthetic data compared to the single table example;
Using the GaussianCopula model for each table with HMA on multi table data generation vs. using the GaussianCopula model with FastML preset on single table data generation. The latter uses a normal distribution, no enforced rounding and the categorical transformer is a FrequencyEncoder instead of a LabelEncoder. More about these encoders on the source code.

It's also important to note that even though the SDV library has a tool to evaluate the quality of the data created we should have a third party library to also test this quality in order to be as unbiased as possible.

Now we know how to create synthetic data and how quick and secure it can be.