Multi Table Synthetic Data¶
Data analysis¶
For multitable we'll be using the strokes table used on the single table analysis and a people table. For the people table each entry has an ID which matches the person_id on the strokes table. This table also has some PII (Personal Identifiable Information) information which is important for us to masquerade.
Load Data¶
First, we go to the content folder and get all csv files there.
from sdv.datasets.local import load_csvs
try:
datasets = load_csvs(folder_name='content/')
except ValueError as e:
print(e)
Then, we access both table and display the first 20 rows of the people table.
print(datasets.keys())
strokes_table = datasets['strokes']
people_table = datasets['people']
people_table.head(20)
dict_keys(['people', 'strokes'])
id | name | address | city | |
---|---|---|---|---|
0 | 56420 | Marcelo Holmes | 49 Walt Whitman Lane | New York |
1 | 51856 | Aleena Hahn | Apple Valley, CA 92307 | Los Angeles |
2 | 41097 | Jocelyn Hancock | 10 West Church St. | Chicago |
3 | 545 | Marcel Underwood | Hastings, MN 55033 | Miami |
4 | 37759 | Jazlyn Davila | 7444 South Pine Dr. | Dallas |
5 | 66333 | Teagan Randall | Malden, MA 02148 | Houston |
6 | 70670 | Antony Graham | 666 Windfall Dr. | Philadelphia |
7 | 20292 | Evelyn Becker | Niagara Falls, NY 14304 | Atlanta |
8 | 72784 | Arturo Dillon | 334 Grove Street | Washington |
9 | 65895 | Alyssa Peters | Moncks Corner, SC 29461 | Boston |
10 | 5131 | Damarion Colon | 58 Canterbury Street | Phoenix |
11 | 72911 | Holden Mccarthy | Lake Jackson, TX 77566 | Detroit |
12 | 1307 | Colten Costa | 345 East Brandywine St. | Seattle |
13 | 23047 | Turner Mcdaniel | Halethorpe, MD 21227 | San Francisco |
14 | 32604 | Jett Knox | 9 Eagle Dr. | San Diego |
15 | 63915 | Talia Olson | Maumee, OH 43537 | Minneapolis |
16 | 25405 | Kate Hale | 59 Vale St. | Brooklyn |
17 | 3590 | Lilliana Warren | Tuckerton, NJ 08087 | Tampa |
18 | 2898 | Ramon Dillon | 95 Cross Ave. | Denver |
19 | 60675 | Ayanna Tyler | Chardon, OH 44024 | Queens |
Create metadata¶
We then need to create the metadata object to be used when creating the synthesizer. SDV will detect some information from the table content but it may not be correct. It's always best to check the metadata and fix whatever needs to be fixed.
from sdv.metadata import MultiTableMetadata
metadata = MultiTableMetadata()
metadata.detect_table_from_dataframe(
table_name='strokes',
data=strokes_table
)
metadata.detect_table_from_dataframe(
table_name='people',
data=people_table
)
print('Auto detected data:\n')
metadata
Auto detected data:
{ "tables": { "strokes": { "columns": { "id": { "sdtype": "numerical" }, "person_id": { "sdtype": "numerical" }, "gender": { "sdtype": "categorical" }, "age": { "sdtype": "numerical" }, "hypertension": { "sdtype": "numerical" }, "heart_disease": { "sdtype": "numerical" }, "ever_married": { "sdtype": "categorical" }, "work_type": { "sdtype": "categorical" }, "Residence_type": { "sdtype": "categorical" }, "avg_glucose_level": { "sdtype": "numerical" }, "bmi": { "sdtype": "numerical" }, "smoking_status": { "sdtype": "categorical" }, "stroke": { "sdtype": "numerical" } } }, "people": { "columns": { "id": { "sdtype": "numerical" }, "name": { "sdtype": "categorical" }, "address": { "sdtype": "categorical" }, "city": { "sdtype": "categorical" } } } }, "relationships": [], "METADATA_SPEC_VERSION": "MULTI_TABLE_V1" }
Edit Metadata¶
Strokes Table¶
Bellow, we make a few changes such as:
- Change column id to type id;
- Change column age to a numerical type of integers;
- Change column bmi to a numerical type of floats;
- Change column stroke to a categorical column since the information is either 0 or 1;
- Change hypertension and heart_disease columns to categorical as well for the same reason.
metadata.update_column(
table_name='strokes',
column_name='id',
sdtype='id'
)
metadata.update_column(
table_name='strokes',
column_name='person_id',
sdtype='id'
)
metadata.update_column(
table_name='strokes',
column_name='age',
sdtype='numerical',
computer_representation="Int64"
)
metadata.update_column(
table_name='strokes',
column_name='bmi',
sdtype='numerical',
computer_representation="Float"
)
metadata.update_column(
table_name='strokes',
column_name='stroke',
sdtype='categorical',
)
metadata.update_column(
table_name='strokes',
column_name='hypertension',
sdtype='categorical',
)
metadata.update_column(
table_name='strokes',
column_name='heart_disease',
sdtype='categorical',
)
print(metadata)
{ "tables": { "strokes": { "columns": { "id": { "sdtype": "id" }, "person_id": { "sdtype": "id" }, "gender": { "sdtype": "categorical" }, "age": { "sdtype": "numerical", "computer_representation": "Int64" }, "hypertension": { "sdtype": "categorical" }, "heart_disease": { "sdtype": "categorical" }, "ever_married": { "sdtype": "categorical" }, "work_type": { "sdtype": "categorical" }, "Residence_type": { "sdtype": "categorical" }, "avg_glucose_level": { "sdtype": "numerical" }, "bmi": { "sdtype": "numerical", "computer_representation": "Float" }, "smoking_status": { "sdtype": "categorical" }, "stroke": { "sdtype": "categorical" } } }, "people": { "columns": { "id": { "sdtype": "numerical" }, "name": { "sdtype": "categorical" }, "address": { "sdtype": "categorical" }, "city": { "sdtype": "categorical" } } } }, "relationships": [], "METADATA_SPEC_VERSION": "MULTI_TABLE_V1" }
People Table¶
Bellow, we make a few changes to people type such as:
- Change column id to type id;
- Change column name to type name and mark as PII;
- Change column address to type address and mark as PII;
- Change column city to type city and mark as PII.
We need to add pii to the columns because they are personable identifiable information.
metadata.update_column(
table_name='people',
column_name='id',
sdtype='id'
)
metadata.update_column(
table_name='people',
column_name='name',
sdtype='name',
pii=True
)
metadata.update_column(
table_name='people',
column_name='address',
sdtype='address',
pii=True
)
metadata.update_column(
table_name='people',
column_name='city',
sdtype='city',
pii=True
)
print(metadata)
{ "tables": { "strokes": { "columns": { "id": { "sdtype": "id" }, "person_id": { "sdtype": "id" }, "gender": { "sdtype": "categorical" }, "age": { "sdtype": "numerical", "computer_representation": "Int64" }, "hypertension": { "sdtype": "categorical" }, "heart_disease": { "sdtype": "categorical" }, "ever_married": { "sdtype": "categorical" }, "work_type": { "sdtype": "categorical" }, "Residence_type": { "sdtype": "categorical" }, "avg_glucose_level": { "sdtype": "numerical" }, "bmi": { "sdtype": "numerical", "computer_representation": "Float" }, "smoking_status": { "sdtype": "categorical" }, "stroke": { "sdtype": "categorical" } } }, "people": { "columns": { "id": { "sdtype": "id" }, "name": { "sdtype": "name", "pii": true }, "address": { "sdtype": "address", "pii": true }, "city": { "sdtype": "city", "pii": true } } } }, "relationships": [], "METADATA_SPEC_VERSION": "MULTI_TABLE_V1" }
Then we need to connect the two tables
metadata.set_primary_key(
table_name='strokes',
column_name='id'
)
metadata.set_primary_key(
table_name='people',
column_name='id'
)
metadata.add_relationship(
parent_table_name='people',
child_table_name='strokes',
parent_primary_key='id',
child_foreign_key='person_id'
)
print(metadata)
{ "tables": { "strokes": { "primary_key": "id", "columns": { "id": { "sdtype": "id" }, "person_id": { "sdtype": "id" }, "gender": { "sdtype": "categorical" }, "age": { "sdtype": "numerical", "computer_representation": "Int64" }, "hypertension": { "sdtype": "categorical" }, "heart_disease": { "sdtype": "categorical" }, "ever_married": { "sdtype": "categorical" }, "work_type": { "sdtype": "categorical" }, "Residence_type": { "sdtype": "categorical" }, "avg_glucose_level": { "sdtype": "numerical" }, "bmi": { "sdtype": "numerical", "computer_representation": "Float" }, "smoking_status": { "sdtype": "categorical" }, "stroke": { "sdtype": "categorical" } } }, "people": { "primary_key": "id", "columns": { "id": { "sdtype": "id" }, "name": { "sdtype": "name", "pii": true }, "address": { "sdtype": "address", "pii": true }, "city": { "sdtype": "city", "pii": true } } } }, "relationships": [ { "parent_table_name": "people", "child_table_name": "strokes", "parent_primary_key": "id", "child_foreign_key": "person_id" } ], "METADATA_SPEC_VERSION": "MULTI_TABLE_V1" }
Create Synthesizer¶
Having created the metadata object we then needed to create the synthesizer which will be trained to generate the synthetic data. Here we use the only possible (excluding the enterprise edition) synthesizer - HMA. Note that you can configure which synthesizer each table uses.
from sdv.multi_table import HMASynthesizer
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(datasets)
We cannot define number of rows when using multitable but we can define a scale:
- <1 : Shrink the data by the specified percentage. For example, 0.9 will create synthetic data that is roughly 90% of the size of the original data.;
- =1 : Don't scale the data. The model will create synthetic data that is roughly the same size as the original data.;
- >1 : Scale the data by the specified factor. For example, 2.5 will create synthetic data that is roughly 2.5x the size of the original data.;
synthetic_data = synthesizer.sample(
scale=2.5
)
synthetic_data
{'people': id name 0 0 Brent Collins \ 1 1 Carlos Mata 2 2 Mrs. Crystal Blair 3 3 Nathaniel Murphy 4 4 Shannon Mitchell ... ... ... 11995 11995 Misty Dominguez 11996 11996 Roberto Brown 11997 11997 Deborah Smith 11998 11998 William Mason 11999 11999 Todd Price address city 0 6157 Clark Rest\nSouth Christophershire, MT 67360 New Johnfurt 1 PSC 8737, Box 1740\nAPO AA 35924 Jacobland 2 922 Smith Union\nPort Shawnbury, DE 51299 Holdenton 3 3937 William Mount\nCohenborough, KY 49299 West Mariah 4 09073 Manning Vista\nGravesport, MT 89179 South Thomashaven ... ... ... 11995 7945 Wright Ford\nMartinbury, AK 83553 New Steven 11996 335 Williams Mills Suite 707\nSarahland, GA 05413 Garciashire 11997 971 Timothy Coves\nPort Daniel, UT 86478 West Calvinchester 11998 499 Erica Drive\nLake Carolyn, PA 58154 Morganmouth 11999 13150 Pamela Walk\nNorth Joshuaside, ND 17432 East Dianeville [12000 rows x 4 columns], 'strokes': id person_id gender age hypertension heart_disease 0 0 0 Male 41 0 0 \ 1 1 1 Female 22 0 0 2 2 2 Male 42 1 1 3 3 3 Male 71 0 0 4 4 4 Female 2 0 0 ... ... ... ... ... ... ... 11995 11995 11995 Other 52 0 0 11996 11996 11996 Male 26 0 0 11997 11997 11997 Female 40 0 0 11998 11998 11998 Female 79 0 0 11999 11999 11999 Male 59 0 0 ever_married work_type Residence_type avg_glucose_level bmi 0 No children Rural 68.03 27.1 \ 1 Yes Never_worked Urban 189.96 22.5 2 No Govt_job Rural 90.20 36.4 3 Yes Never_worked Rural 175.26 31.3 4 No Never_worked Rural 64.72 16.2 ... ... ... ... ... ... 11995 No Never_worked Urban 62.05 29.4 11996 Yes Govt_job Urban 122.06 17.0 11997 Yes Private Urban 187.24 39.3 11998 Yes Self-employed Rural 147.39 27.7 11999 Yes Govt_job Urban 71.11 33.2 smoking_status stroke 0 never smoked 0 1 Unknown 0 2 formerly smoked 1 3 Unknown 0 4 Unknown 0 ... ... ... 11995 Unknown 0 11996 Unknown 0 11997 formerly smoked 0 11998 never smoked 0 11999 formerly smoked 0 [12000 rows x 13 columns]}
If you search for this info on the original data you can see that it's not there since we marked name, address and city as PII.
Evaluation
from sdv.evaluation.multi_table import evaluate_quality
quality_report = evaluate_quality(
real_data=datasets,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|██████████| 5/5 [00:00<00:00, 15.89it/s]
Overall Quality Score: 91.12% Properties: Column Shapes: 90.35% Column Pair Trends: 83.01% Parent Child Relationships: 100.0%
from sdv.evaluation.multi_table import run_diagnostic
diagnostic_report = run_diagnostic(
real_data=datasets,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|██████████| 4/4 [00:28<00:00, 7.15s/it]
DiagnosticResults: SUCCESS: ✓ The synthetic data covers over 90% of the numerical ranges present in the real data ✓ The synthetic data covers over 90% of the categories present in the real data ✓ Over 90% of the synthetic rows are not copies of the real data ✓ The synthetic data follows over 90% of the min/max boundaries set by the real data
diagnostic_report.get_results()
{'SUCCESS': ['The synthetic data covers over 90% of the numerical ranges present in the real data', 'The synthetic data covers over 90% of the categories present in the real data', 'Over 90% of the synthetic rows are not copies of the real data', 'The synthetic data follows over 90% of the min/max boundaries set by the real data'], 'WARNING': [], 'DANGER': []}
diagnostic_report.get_properties()
{'Coverage': 0.9733928657693466, 'Synthesis': 1.0, 'Boundaries': 1.0}
PII Locales¶
It's also possible to generate data from certain languages on certain countries. For example if we want only french canadian names here's how to proceed.
We create a AnonymizedFaker which receives the provider and function names from the Python Faker Library and a locales array with possible language_countries. And then we retrain the synthesizer for our changes to be learned.
from rdt.transformers.pii import AnonymizedFaker
synthesizer.update_transformers(table_name="people", column_name_to_transformer={
'name': AnonymizedFaker(provider_name='person', function_name='name', locales=['fr_CA'])
})
synthesizer.fit(datasets)
/Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`. warnings.warn(msg, UserWarning) /Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`. warnings.warn(msg, UserWarning) /Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`. warnings.warn(msg, UserWarning)
Then we generate the data once again (this time a smaller sample). And we can see that the names on the people table are now french canadian names.
synthesizer.sample(scale=0.01)
{'people': id name 0 0 Emmanuelle Poirier \ 1 1 Maude Blais 2 2 Caroline Trottier-Lemire 3 3 Édouard Soucy 4 4 Alexis-Emmanuel Séguin 5 5 Pénélope Gervais 6 6 Dorothée Marcoux-Larivière 7 7 Julien Larose 8 8 Michèle Couture 9 9 Susanne St-Onge 10 10 Jeannine-Manon Daigle 11 11 Martin Gingras-Provost 12 12 Emmanuel Dionne-Rodrigue 13 13 Michèle Bérubé 14 14 Josette Bérubé 15 15 Juliette-Sylvie Bernier 16 16 Pauline Nadeau 17 17 Marcel Létourneau 18 18 Jacques Fortier 19 19 Louis Lepage 20 20 Céline Arsenault 21 21 Xavier-Jean Larose 22 22 William Bédard-Germain 23 23 Maude Lepage 24 24 Mathieu Chabot 25 25 Thomas Champagne-Morissette 26 26 Timothée Robitaille 27 27 Robert Arsenault 28 28 Henriette Lévesque 29 29 Rémy Morin 30 30 Nathan Blouin 31 31 Alex Bernard-Rousseau 32 32 Henri Bisson 33 33 Maxime Savard 34 34 Roger Robitaille 35 35 Yves Béland-Paradis 36 36 Tristan Roberge-Dumont 37 37 Noël Dufour 38 38 Aurore Lebel 39 39 Bertrand Lemire 40 40 Thomas Lemieux-Fournier 41 41 Bertrand Gilbert 42 42 Nicolas Lussier 43 43 Alexandre Lacroix 44 44 Thomas Gauthier 45 45 Olivia Beauchamp 46 46 Jules Boucher 47 47 Constance Leclerc address city 0 6157 Clark Rest\nSouth Christophershire, MT 67360 New Johnfurt 1 PSC 8737, Box 1740\nAPO AA 35924 Jacobland 2 922 Smith Union\nPort Shawnbury, DE 51299 Holdenton 3 3937 William Mount\nCohenborough, KY 49299 West Mariah 4 09073 Manning Vista\nGravesport, MT 89179 South Thomashaven 5 2619 White Fields Apt. 532\nSouth Amandaport, ... New Taraville 6 23276 Billy Plains\nMartinezberg, LA 87469 Michaelshire 7 0704 Smith Walks Apt. 228\nMillerburgh, WY 89418 Danielburgh 8 521 Sara Street\nHectorside, DC 24587 Robertsland 9 174 Velasquez Court\nEast Jenniferport, ID 40915 Marvinchester 10 153 Yu Island\nPalmermouth, MS 89535 South Kimberly 11 48142 Timothy Summit\nShannonview, MI 72357 Gutierrezshire 12 6397 Melissa Circle\nPort Benjamin, DE 42448 New Jack 13 55144 Brooks Walk\nSouth Andrew, OK 36891 Nelsonton 14 9230 Garza Parks\nPort Nataliefurt, TX 81751 Martinton 15 61788 Freeman Mill\nMillerland, WI 27888 Lake Jeffrey 16 409 Hodges Street\nParrishborough, ME 97469 Port Cody 17 89649 Joseph Path\nRickyland, NC 57649 West Andrea 18 4685 Hawkins Haven Suite 592\nRiveraberg, KS 0... Port Tanya 19 83913 Patricia Gardens\nEricksonville, IN 08631 Andersonshire 20 2086 Mahoney Unions\nEricaburgh, WV 53247 Port Katie 21 26457 Wendy Wells Apt. 903\nNorth Megan, OH 70806 North Thomas 22 33704 Tiffany Tunnel\nLake Anthonyfort, TX 93085 East Sarah 23 714 Wendy Estate\nJoseborough, KY 03379 Jameston 24 Unit 5426 Box 7165\nDPO AP 37872 West Jacquelinefurt 25 9589 Elizabeth Springs\nKevinville, MI 72107 East Jason 26 336 Franklin Crossroad\nMelissaberg, KY 57917 Lake Melissa 27 5328 Lauren Valley\nNew Ryanport, TN 14377 South Karenville 28 106 Carney Views Suite 400\nGarciamouth, CO 56012 Smithview 29 599 Jason Ford Suite 531\nEast Tammy, OK 38545 Hamiltonshire 30 2642 Graham Plains Apt. 732\nMullenchester, MN... West Lindsey 31 3418 Byrd Loop Suite 726\nWardhaven, MO 11008 New Kathleenborough 32 9012 Laura Viaduct Apt. 726\nGrahamland, MI 86340 North Glendafurt 33 2656 Mcmillan Wall\nEast Douglasborough, TX 75230 Brandonside 34 1191 David Rest\nEast Lauren, MO 75586 Brownmouth 35 0479 Jensen Alley\nChristophermouth, MO 57533 East Matthewbury 36 USNS Ross\nFPO AE 38922 East Jennifer 37 533 Wang Junction Suite 546\nLake Michaelshire... East Richard 38 115 Webb Springs Suite 300\nAngelaville, FL 58789 Port Heatherfort 39 PSC 5539, Box 0143\nAPO AA 19955 Melissashire 40 69349 Lisa Mountains Apt. 851\nRobinsonfurt, C... Wangport 41 76314 Hernandez Lock\nHolmestown, MO 88896 South Jilltown 42 390 John Orchard Apt. 594\nRobertside, MO 87023 South Robert 43 Unit 8573 Box 9313\nDPO AE 22469 Schultzbury 44 144 Richard Fields Apt. 599\nWeissside, MO 19466 Christopherview 45 227 Doyle Islands\nTimothychester, NM 14247 Calebview 46 397 Thompson Springs Apt. 535\nDestinymouth, N... South Mitchellberg 47 3376 Garrett Crescent\nBenjaminborough, MO 03106 Austinborough , 'strokes': id person_id gender age hypertension heart_disease ever_married 0 0 0 Male 59 0 0 Yes \ 1 1 1 Male 36 0 0 Yes 2 2 2 Male 33 0 0 No 3 3 3 Female 39 0 0 Yes 4 4 4 Male 65 0 0 Yes 5 5 5 Female 72 0 0 Yes 6 6 6 Male 65 0 0 Yes 7 7 7 Male 49 0 1 No 8 8 8 Female 75 0 0 Yes 9 9 9 Female 37 0 0 Yes 10 10 10 Male 36 0 0 No 11 11 11 Male 58 0 0 Yes 12 12 12 Female 60 0 0 No 13 13 13 Male 48 0 0 No 14 14 14 Male 78 1 1 Yes 15 15 15 Male 63 0 0 Yes 16 16 16 Male 11 1 0 No 17 17 17 Female 13 0 0 No 18 18 18 Female 3 0 0 No 19 19 19 Female 35 0 0 Yes 20 20 20 Female 18 0 0 Yes 21 21 21 Female 47 0 0 Yes 22 22 22 Male 33 1 0 No 23 23 23 Male 17 0 1 No 24 24 24 Female 52 0 0 No 25 25 25 Male 72 0 0 Yes 26 26 26 Male 35 0 0 Yes 27 27 27 Male 61 0 0 No 28 28 28 Female 71 0 0 Yes 29 29 29 Female 76 0 0 Yes 30 30 30 Other 2 0 0 No 31 31 31 Male 30 1 1 No 32 32 32 Female 29 0 0 No 33 33 33 Female 81 1 0 Yes 34 34 34 Male 10 0 1 No 35 35 35 Male 68 0 0 No 36 36 36 Female 21 0 0 No 37 37 37 Female 44 0 0 Yes 38 38 38 Female 64 0 0 No 39 39 39 Female 37 0 0 Yes 40 40 40 Other 63 0 0 No 41 41 41 Female 35 0 0 Yes 42 42 42 Male 11 0 0 No 43 43 43 Male 27 0 0 Yes 44 44 44 Female 82 0 0 Yes 45 45 45 Female 70 0 0 No 46 46 46 Female 81 0 0 Yes 47 47 47 Other 7 0 0 No work_type Residence_type avg_glucose_level bmi smoking_status 0 Private Rural 105.99 26.2 never smoked \ 1 children Urban 124.18 24.8 formerly smoked 2 Private Urban 56.59 21.5 Unknown 3 Self-employed Urban 92.89 22.2 never smoked 4 children Rural 149.89 31.9 Unknown 5 Self-employed Rural 134.48 23.2 never smoked 6 children Urban 139.36 34.5 smokes 7 children Rural 167.76 31.2 formerly smoked 8 children Urban 102.79 37.2 never smoked 9 Private Urban 87.34 35.7 Unknown 10 Private Urban 63.28 18.4 never smoked 11 children Rural 205.38 28.4 Unknown 12 Private Urban 132.14 23.3 smokes 13 Self-employed Urban 81.61 14.0 never smoked 14 children Rural 114.62 30.1 smokes 15 Self-employed Urban 137.29 37.7 smokes 16 Never_worked Urban 135.24 20.4 Unknown 17 Private Urban 87.55 24.0 never smoked 18 Private Rural 55.88 24.1 Unknown 19 Self-employed Urban 56.21 38.0 never smoked 20 children Urban 94.29 21.9 Unknown 21 Private Urban 147.00 36.8 formerly smoked 22 children Rural 127.40 23.7 formerly smoked 23 Govt_job Urban 57.79 19.4 never smoked 24 children Rural 121.97 38.2 formerly smoked 25 Self-employed Urban 57.00 15.8 never smoked 26 children Rural 212.33 54.7 never smoked 27 Self-employed Rural 88.19 35.0 smokes 28 Private Urban 98.38 32.7 never smoked 29 Self-employed Urban 73.53 32.4 never smoked 30 Never_worked Rural 182.09 21.4 smokes 31 children Rural 117.16 23.3 smokes 32 Govt_job Rural 121.17 22.7 never smoked 33 Private Rural 202.05 32.9 never smoked 34 Self-employed Urban 62.05 18.7 Unknown 35 Private Rural 71.35 32.8 never smoked 36 children Urban 171.71 26.4 never smoked 37 Self-employed Rural 123.41 29.3 smokes 38 Private Urban 78.75 21.8 never smoked 39 Never_worked Rural 69.86 19.8 Unknown 40 Self-employed Rural 126.36 23.7 never smoked 41 Self-employed Urban 155.57 46.4 smokes 42 Private Urban 75.05 22.3 Unknown 43 Self-employed Urban 77.96 30.5 never smoked 44 Private Urban 113.75 26.0 formerly smoked 45 Self-employed Urban 105.87 41.3 smokes 46 Self-employed Rural 220.69 29.4 never smoked 47 children Rural 103.11 23.4 Unknown stroke 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 15 0 16 0 17 0 18 0 19 0 20 0 21 0 22 1 23 1 24 0 25 0 26 0 27 0 28 0 29 0 30 1 31 1 32 0 33 0 34 0 35 1 36 1 37 0 38 0 39 0 40 0 41 0 42 0 43 0 44 0 45 0 46 0 47 0 }
Conclusion¶
This time on the multi table generation we got a better coverage percentage on the quality report. This may be because of:
- Having 2.5x the synthetic data compared to the single table example;
- Using the GaussianCopula model for each table with HMA on multi table data generation vs. using the GaussianCopula model with FastML preset on single table data generation. The latter uses a normal distribution, no enforced rounding and the categorical transformer is a FrequencyEncoder instead of a LabelEncoder. More about these encoders on the source code.
It's also important to note that even though the SDV library has a tool to evaluate the quality of the data created we should have a third party library to also test this quality in order to be as unbiased as possible.
Now we know how to create synthetic data and how quick and secure it can be.