Multi Table Synthetic Data¶
Data analysis¶
For multitable we'll be using the strokes table used on the single table analysis and a people table. For the people table each entry has an ID which matches the person_id on the strokes table. This table also has some PII (Personal Identifiable Information) information which is important for us to masquerade.
Load Data¶
First, we go to the content folder and get all csv files there.
from sdv.datasets.local import load_csvs
try:
datasets = load_csvs(folder_name='content/')
except ValueError as e:
print(e)
Then, we access both table and display the first 20 rows of the people table.
print(datasets.keys())
strokes_table = datasets['strokes']
people_table = datasets['people']
people_table.head(20)
dict_keys(['people', 'strokes'])
| id | name | address | city | |
|---|---|---|---|---|
| 0 | 56420 | Marcelo Holmes | 49 Walt Whitman Lane | New York |
| 1 | 51856 | Aleena Hahn | Apple Valley, CA 92307 | Los Angeles |
| 2 | 41097 | Jocelyn Hancock | 10 West Church St. | Chicago |
| 3 | 545 | Marcel Underwood | Hastings, MN 55033 | Miami |
| 4 | 37759 | Jazlyn Davila | 7444 South Pine Dr. | Dallas |
| 5 | 66333 | Teagan Randall | Malden, MA 02148 | Houston |
| 6 | 70670 | Antony Graham | 666 Windfall Dr. | Philadelphia |
| 7 | 20292 | Evelyn Becker | Niagara Falls, NY 14304 | Atlanta |
| 8 | 72784 | Arturo Dillon | 334 Grove Street | Washington |
| 9 | 65895 | Alyssa Peters | Moncks Corner, SC 29461 | Boston |
| 10 | 5131 | Damarion Colon | 58 Canterbury Street | Phoenix |
| 11 | 72911 | Holden Mccarthy | Lake Jackson, TX 77566 | Detroit |
| 12 | 1307 | Colten Costa | 345 East Brandywine St. | Seattle |
| 13 | 23047 | Turner Mcdaniel | Halethorpe, MD 21227 | San Francisco |
| 14 | 32604 | Jett Knox | 9 Eagle Dr. | San Diego |
| 15 | 63915 | Talia Olson | Maumee, OH 43537 | Minneapolis |
| 16 | 25405 | Kate Hale | 59 Vale St. | Brooklyn |
| 17 | 3590 | Lilliana Warren | Tuckerton, NJ 08087 | Tampa |
| 18 | 2898 | Ramon Dillon | 95 Cross Ave. | Denver |
| 19 | 60675 | Ayanna Tyler | Chardon, OH 44024 | Queens |
Create metadata¶
We then need to create the metadata object to be used when creating the synthesizer. SDV will detect some information from the table content but it may not be correct. It's always best to check the metadata and fix whatever needs to be fixed.
from sdv.metadata import MultiTableMetadata
metadata = MultiTableMetadata()
metadata.detect_table_from_dataframe(
table_name='strokes',
data=strokes_table
)
metadata.detect_table_from_dataframe(
table_name='people',
data=people_table
)
print('Auto detected data:\n')
metadata
Auto detected data:
{
"tables": {
"strokes": {
"columns": {
"id": {
"sdtype": "numerical"
},
"person_id": {
"sdtype": "numerical"
},
"gender": {
"sdtype": "categorical"
},
"age": {
"sdtype": "numerical"
},
"hypertension": {
"sdtype": "numerical"
},
"heart_disease": {
"sdtype": "numerical"
},
"ever_married": {
"sdtype": "categorical"
},
"work_type": {
"sdtype": "categorical"
},
"Residence_type": {
"sdtype": "categorical"
},
"avg_glucose_level": {
"sdtype": "numerical"
},
"bmi": {
"sdtype": "numerical"
},
"smoking_status": {
"sdtype": "categorical"
},
"stroke": {
"sdtype": "numerical"
}
}
},
"people": {
"columns": {
"id": {
"sdtype": "numerical"
},
"name": {
"sdtype": "categorical"
},
"address": {
"sdtype": "categorical"
},
"city": {
"sdtype": "categorical"
}
}
}
},
"relationships": [],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}
Edit Metadata¶
Strokes Table¶
Bellow, we make a few changes such as:
- Change column id to type id;
- Change column age to a numerical type of integers;
- Change column bmi to a numerical type of floats;
- Change column stroke to a categorical column since the information is either 0 or 1;
- Change hypertension and heart_disease columns to categorical as well for the same reason.
metadata.update_column(
table_name='strokes',
column_name='id',
sdtype='id'
)
metadata.update_column(
table_name='strokes',
column_name='person_id',
sdtype='id'
)
metadata.update_column(
table_name='strokes',
column_name='age',
sdtype='numerical',
computer_representation="Int64"
)
metadata.update_column(
table_name='strokes',
column_name='bmi',
sdtype='numerical',
computer_representation="Float"
)
metadata.update_column(
table_name='strokes',
column_name='stroke',
sdtype='categorical',
)
metadata.update_column(
table_name='strokes',
column_name='hypertension',
sdtype='categorical',
)
metadata.update_column(
table_name='strokes',
column_name='heart_disease',
sdtype='categorical',
)
print(metadata)
{
"tables": {
"strokes": {
"columns": {
"id": {
"sdtype": "id"
},
"person_id": {
"sdtype": "id"
},
"gender": {
"sdtype": "categorical"
},
"age": {
"sdtype": "numerical",
"computer_representation": "Int64"
},
"hypertension": {
"sdtype": "categorical"
},
"heart_disease": {
"sdtype": "categorical"
},
"ever_married": {
"sdtype": "categorical"
},
"work_type": {
"sdtype": "categorical"
},
"Residence_type": {
"sdtype": "categorical"
},
"avg_glucose_level": {
"sdtype": "numerical"
},
"bmi": {
"sdtype": "numerical",
"computer_representation": "Float"
},
"smoking_status": {
"sdtype": "categorical"
},
"stroke": {
"sdtype": "categorical"
}
}
},
"people": {
"columns": {
"id": {
"sdtype": "numerical"
},
"name": {
"sdtype": "categorical"
},
"address": {
"sdtype": "categorical"
},
"city": {
"sdtype": "categorical"
}
}
}
},
"relationships": [],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}
People Table¶
Bellow, we make a few changes to people type such as:
- Change column id to type id;
- Change column name to type name and mark as PII;
- Change column address to type address and mark as PII;
- Change column city to type city and mark as PII.
We need to add pii to the columns because they are personable identifiable information.
metadata.update_column(
table_name='people',
column_name='id',
sdtype='id'
)
metadata.update_column(
table_name='people',
column_name='name',
sdtype='name',
pii=True
)
metadata.update_column(
table_name='people',
column_name='address',
sdtype='address',
pii=True
)
metadata.update_column(
table_name='people',
column_name='city',
sdtype='city',
pii=True
)
print(metadata)
{
"tables": {
"strokes": {
"columns": {
"id": {
"sdtype": "id"
},
"person_id": {
"sdtype": "id"
},
"gender": {
"sdtype": "categorical"
},
"age": {
"sdtype": "numerical",
"computer_representation": "Int64"
},
"hypertension": {
"sdtype": "categorical"
},
"heart_disease": {
"sdtype": "categorical"
},
"ever_married": {
"sdtype": "categorical"
},
"work_type": {
"sdtype": "categorical"
},
"Residence_type": {
"sdtype": "categorical"
},
"avg_glucose_level": {
"sdtype": "numerical"
},
"bmi": {
"sdtype": "numerical",
"computer_representation": "Float"
},
"smoking_status": {
"sdtype": "categorical"
},
"stroke": {
"sdtype": "categorical"
}
}
},
"people": {
"columns": {
"id": {
"sdtype": "id"
},
"name": {
"sdtype": "name",
"pii": true
},
"address": {
"sdtype": "address",
"pii": true
},
"city": {
"sdtype": "city",
"pii": true
}
}
}
},
"relationships": [],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}
Then we need to connect the two tables
metadata.set_primary_key(
table_name='strokes',
column_name='id'
)
metadata.set_primary_key(
table_name='people',
column_name='id'
)
metadata.add_relationship(
parent_table_name='people',
child_table_name='strokes',
parent_primary_key='id',
child_foreign_key='person_id'
)
print(metadata)
{
"tables": {
"strokes": {
"primary_key": "id",
"columns": {
"id": {
"sdtype": "id"
},
"person_id": {
"sdtype": "id"
},
"gender": {
"sdtype": "categorical"
},
"age": {
"sdtype": "numerical",
"computer_representation": "Int64"
},
"hypertension": {
"sdtype": "categorical"
},
"heart_disease": {
"sdtype": "categorical"
},
"ever_married": {
"sdtype": "categorical"
},
"work_type": {
"sdtype": "categorical"
},
"Residence_type": {
"sdtype": "categorical"
},
"avg_glucose_level": {
"sdtype": "numerical"
},
"bmi": {
"sdtype": "numerical",
"computer_representation": "Float"
},
"smoking_status": {
"sdtype": "categorical"
},
"stroke": {
"sdtype": "categorical"
}
}
},
"people": {
"primary_key": "id",
"columns": {
"id": {
"sdtype": "id"
},
"name": {
"sdtype": "name",
"pii": true
},
"address": {
"sdtype": "address",
"pii": true
},
"city": {
"sdtype": "city",
"pii": true
}
}
}
},
"relationships": [
{
"parent_table_name": "people",
"child_table_name": "strokes",
"parent_primary_key": "id",
"child_foreign_key": "person_id"
}
],
"METADATA_SPEC_VERSION": "MULTI_TABLE_V1"
}
Create Synthesizer¶
Having created the metadata object we then needed to create the synthesizer which will be trained to generate the synthetic data. Here we use the only possible (excluding the enterprise edition) synthesizer - HMA. Note that you can configure which synthesizer each table uses.
from sdv.multi_table import HMASynthesizer
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(datasets)
We cannot define number of rows when using multitable but we can define a scale:
- <1 : Shrink the data by the specified percentage. For example, 0.9 will create synthetic data that is roughly 90% of the size of the original data.;
- =1 : Don't scale the data. The model will create synthetic data that is roughly the same size as the original data.;
- >1 : Scale the data by the specified factor. For example, 2.5 will create synthetic data that is roughly 2.5x the size of the original data.;
synthetic_data = synthesizer.sample(
scale=2.5
)
synthetic_data
{'people': id name
0 0 Brent Collins \
1 1 Carlos Mata
2 2 Mrs. Crystal Blair
3 3 Nathaniel Murphy
4 4 Shannon Mitchell
... ... ...
11995 11995 Misty Dominguez
11996 11996 Roberto Brown
11997 11997 Deborah Smith
11998 11998 William Mason
11999 11999 Todd Price
address city
0 6157 Clark Rest\nSouth Christophershire, MT 67360 New Johnfurt
1 PSC 8737, Box 1740\nAPO AA 35924 Jacobland
2 922 Smith Union\nPort Shawnbury, DE 51299 Holdenton
3 3937 William Mount\nCohenborough, KY 49299 West Mariah
4 09073 Manning Vista\nGravesport, MT 89179 South Thomashaven
... ... ...
11995 7945 Wright Ford\nMartinbury, AK 83553 New Steven
11996 335 Williams Mills Suite 707\nSarahland, GA 05413 Garciashire
11997 971 Timothy Coves\nPort Daniel, UT 86478 West Calvinchester
11998 499 Erica Drive\nLake Carolyn, PA 58154 Morganmouth
11999 13150 Pamela Walk\nNorth Joshuaside, ND 17432 East Dianeville
[12000 rows x 4 columns],
'strokes': id person_id gender age hypertension heart_disease
0 0 0 Male 41 0 0 \
1 1 1 Female 22 0 0
2 2 2 Male 42 1 1
3 3 3 Male 71 0 0
4 4 4 Female 2 0 0
... ... ... ... ... ... ...
11995 11995 11995 Other 52 0 0
11996 11996 11996 Male 26 0 0
11997 11997 11997 Female 40 0 0
11998 11998 11998 Female 79 0 0
11999 11999 11999 Male 59 0 0
ever_married work_type Residence_type avg_glucose_level bmi
0 No children Rural 68.03 27.1 \
1 Yes Never_worked Urban 189.96 22.5
2 No Govt_job Rural 90.20 36.4
3 Yes Never_worked Rural 175.26 31.3
4 No Never_worked Rural 64.72 16.2
... ... ... ... ... ...
11995 No Never_worked Urban 62.05 29.4
11996 Yes Govt_job Urban 122.06 17.0
11997 Yes Private Urban 187.24 39.3
11998 Yes Self-employed Rural 147.39 27.7
11999 Yes Govt_job Urban 71.11 33.2
smoking_status stroke
0 never smoked 0
1 Unknown 0
2 formerly smoked 1
3 Unknown 0
4 Unknown 0
... ... ...
11995 Unknown 0
11996 Unknown 0
11997 formerly smoked 0
11998 never smoked 0
11999 formerly smoked 0
[12000 rows x 13 columns]}
If you search for this info on the original data you can see that it's not there since we marked name, address and city as PII.
Evaluation
from sdv.evaluation.multi_table import evaluate_quality
quality_report = evaluate_quality(
real_data=datasets,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|██████████| 5/5 [00:00<00:00, 15.89it/s]
Overall Quality Score: 91.12% Properties: Column Shapes: 90.35% Column Pair Trends: 83.01% Parent Child Relationships: 100.0%
from sdv.evaluation.multi_table import run_diagnostic
diagnostic_report = run_diagnostic(
real_data=datasets,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|██████████| 4/4 [00:28<00:00, 7.15s/it]
DiagnosticResults: SUCCESS: ✓ The synthetic data covers over 90% of the numerical ranges present in the real data ✓ The synthetic data covers over 90% of the categories present in the real data ✓ Over 90% of the synthetic rows are not copies of the real data ✓ The synthetic data follows over 90% of the min/max boundaries set by the real data
diagnostic_report.get_results()
{'SUCCESS': ['The synthetic data covers over 90% of the numerical ranges present in the real data',
'The synthetic data covers over 90% of the categories present in the real data',
'Over 90% of the synthetic rows are not copies of the real data',
'The synthetic data follows over 90% of the min/max boundaries set by the real data'],
'WARNING': [],
'DANGER': []}
diagnostic_report.get_properties()
{'Coverage': 0.9733928657693466, 'Synthesis': 1.0, 'Boundaries': 1.0}
PII Locales¶
It's also possible to generate data from certain languages on certain countries. For example if we want only french canadian names here's how to proceed.
We create a AnonymizedFaker which receives the provider and function names from the Python Faker Library and a locales array with possible language_countries. And then we retrain the synthesizer for our changes to be learned.
from rdt.transformers.pii import AnonymizedFaker
synthesizer.update_transformers(table_name="people", column_name_to_transformer={
'name': AnonymizedFaker(provider_name='person', function_name='name', locales=['fr_CA'])
})
synthesizer.fit(datasets)
/Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`. warnings.warn(msg, UserWarning) /Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`. warnings.warn(msg, UserWarning) /Users/vascopais/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/single_table/base.py:292: UserWarning: For this change to take effect, please refit the synthesizer using `fit`. warnings.warn(msg, UserWarning)
Then we generate the data once again (this time a smaller sample). And we can see that the names on the people table are now french canadian names.
synthesizer.sample(scale=0.01)
{'people': id name
0 0 Emmanuelle Poirier \
1 1 Maude Blais
2 2 Caroline Trottier-Lemire
3 3 Édouard Soucy
4 4 Alexis-Emmanuel Séguin
5 5 Pénélope Gervais
6 6 Dorothée Marcoux-Larivière
7 7 Julien Larose
8 8 Michèle Couture
9 9 Susanne St-Onge
10 10 Jeannine-Manon Daigle
11 11 Martin Gingras-Provost
12 12 Emmanuel Dionne-Rodrigue
13 13 Michèle Bérubé
14 14 Josette Bérubé
15 15 Juliette-Sylvie Bernier
16 16 Pauline Nadeau
17 17 Marcel Létourneau
18 18 Jacques Fortier
19 19 Louis Lepage
20 20 Céline Arsenault
21 21 Xavier-Jean Larose
22 22 William Bédard-Germain
23 23 Maude Lepage
24 24 Mathieu Chabot
25 25 Thomas Champagne-Morissette
26 26 Timothée Robitaille
27 27 Robert Arsenault
28 28 Henriette Lévesque
29 29 Rémy Morin
30 30 Nathan Blouin
31 31 Alex Bernard-Rousseau
32 32 Henri Bisson
33 33 Maxime Savard
34 34 Roger Robitaille
35 35 Yves Béland-Paradis
36 36 Tristan Roberge-Dumont
37 37 Noël Dufour
38 38 Aurore Lebel
39 39 Bertrand Lemire
40 40 Thomas Lemieux-Fournier
41 41 Bertrand Gilbert
42 42 Nicolas Lussier
43 43 Alexandre Lacroix
44 44 Thomas Gauthier
45 45 Olivia Beauchamp
46 46 Jules Boucher
47 47 Constance Leclerc
address city
0 6157 Clark Rest\nSouth Christophershire, MT 67360 New Johnfurt
1 PSC 8737, Box 1740\nAPO AA 35924 Jacobland
2 922 Smith Union\nPort Shawnbury, DE 51299 Holdenton
3 3937 William Mount\nCohenborough, KY 49299 West Mariah
4 09073 Manning Vista\nGravesport, MT 89179 South Thomashaven
5 2619 White Fields Apt. 532\nSouth Amandaport, ... New Taraville
6 23276 Billy Plains\nMartinezberg, LA 87469 Michaelshire
7 0704 Smith Walks Apt. 228\nMillerburgh, WY 89418 Danielburgh
8 521 Sara Street\nHectorside, DC 24587 Robertsland
9 174 Velasquez Court\nEast Jenniferport, ID 40915 Marvinchester
10 153 Yu Island\nPalmermouth, MS 89535 South Kimberly
11 48142 Timothy Summit\nShannonview, MI 72357 Gutierrezshire
12 6397 Melissa Circle\nPort Benjamin, DE 42448 New Jack
13 55144 Brooks Walk\nSouth Andrew, OK 36891 Nelsonton
14 9230 Garza Parks\nPort Nataliefurt, TX 81751 Martinton
15 61788 Freeman Mill\nMillerland, WI 27888 Lake Jeffrey
16 409 Hodges Street\nParrishborough, ME 97469 Port Cody
17 89649 Joseph Path\nRickyland, NC 57649 West Andrea
18 4685 Hawkins Haven Suite 592\nRiveraberg, KS 0... Port Tanya
19 83913 Patricia Gardens\nEricksonville, IN 08631 Andersonshire
20 2086 Mahoney Unions\nEricaburgh, WV 53247 Port Katie
21 26457 Wendy Wells Apt. 903\nNorth Megan, OH 70806 North Thomas
22 33704 Tiffany Tunnel\nLake Anthonyfort, TX 93085 East Sarah
23 714 Wendy Estate\nJoseborough, KY 03379 Jameston
24 Unit 5426 Box 7165\nDPO AP 37872 West Jacquelinefurt
25 9589 Elizabeth Springs\nKevinville, MI 72107 East Jason
26 336 Franklin Crossroad\nMelissaberg, KY 57917 Lake Melissa
27 5328 Lauren Valley\nNew Ryanport, TN 14377 South Karenville
28 106 Carney Views Suite 400\nGarciamouth, CO 56012 Smithview
29 599 Jason Ford Suite 531\nEast Tammy, OK 38545 Hamiltonshire
30 2642 Graham Plains Apt. 732\nMullenchester, MN... West Lindsey
31 3418 Byrd Loop Suite 726\nWardhaven, MO 11008 New Kathleenborough
32 9012 Laura Viaduct Apt. 726\nGrahamland, MI 86340 North Glendafurt
33 2656 Mcmillan Wall\nEast Douglasborough, TX 75230 Brandonside
34 1191 David Rest\nEast Lauren, MO 75586 Brownmouth
35 0479 Jensen Alley\nChristophermouth, MO 57533 East Matthewbury
36 USNS Ross\nFPO AE 38922 East Jennifer
37 533 Wang Junction Suite 546\nLake Michaelshire... East Richard
38 115 Webb Springs Suite 300\nAngelaville, FL 58789 Port Heatherfort
39 PSC 5539, Box 0143\nAPO AA 19955 Melissashire
40 69349 Lisa Mountains Apt. 851\nRobinsonfurt, C... Wangport
41 76314 Hernandez Lock\nHolmestown, MO 88896 South Jilltown
42 390 John Orchard Apt. 594\nRobertside, MO 87023 South Robert
43 Unit 8573 Box 9313\nDPO AE 22469 Schultzbury
44 144 Richard Fields Apt. 599\nWeissside, MO 19466 Christopherview
45 227 Doyle Islands\nTimothychester, NM 14247 Calebview
46 397 Thompson Springs Apt. 535\nDestinymouth, N... South Mitchellberg
47 3376 Garrett Crescent\nBenjaminborough, MO 03106 Austinborough ,
'strokes': id person_id gender age hypertension heart_disease ever_married
0 0 0 Male 59 0 0 Yes \
1 1 1 Male 36 0 0 Yes
2 2 2 Male 33 0 0 No
3 3 3 Female 39 0 0 Yes
4 4 4 Male 65 0 0 Yes
5 5 5 Female 72 0 0 Yes
6 6 6 Male 65 0 0 Yes
7 7 7 Male 49 0 1 No
8 8 8 Female 75 0 0 Yes
9 9 9 Female 37 0 0 Yes
10 10 10 Male 36 0 0 No
11 11 11 Male 58 0 0 Yes
12 12 12 Female 60 0 0 No
13 13 13 Male 48 0 0 No
14 14 14 Male 78 1 1 Yes
15 15 15 Male 63 0 0 Yes
16 16 16 Male 11 1 0 No
17 17 17 Female 13 0 0 No
18 18 18 Female 3 0 0 No
19 19 19 Female 35 0 0 Yes
20 20 20 Female 18 0 0 Yes
21 21 21 Female 47 0 0 Yes
22 22 22 Male 33 1 0 No
23 23 23 Male 17 0 1 No
24 24 24 Female 52 0 0 No
25 25 25 Male 72 0 0 Yes
26 26 26 Male 35 0 0 Yes
27 27 27 Male 61 0 0 No
28 28 28 Female 71 0 0 Yes
29 29 29 Female 76 0 0 Yes
30 30 30 Other 2 0 0 No
31 31 31 Male 30 1 1 No
32 32 32 Female 29 0 0 No
33 33 33 Female 81 1 0 Yes
34 34 34 Male 10 0 1 No
35 35 35 Male 68 0 0 No
36 36 36 Female 21 0 0 No
37 37 37 Female 44 0 0 Yes
38 38 38 Female 64 0 0 No
39 39 39 Female 37 0 0 Yes
40 40 40 Other 63 0 0 No
41 41 41 Female 35 0 0 Yes
42 42 42 Male 11 0 0 No
43 43 43 Male 27 0 0 Yes
44 44 44 Female 82 0 0 Yes
45 45 45 Female 70 0 0 No
46 46 46 Female 81 0 0 Yes
47 47 47 Other 7 0 0 No
work_type Residence_type avg_glucose_level bmi smoking_status
0 Private Rural 105.99 26.2 never smoked \
1 children Urban 124.18 24.8 formerly smoked
2 Private Urban 56.59 21.5 Unknown
3 Self-employed Urban 92.89 22.2 never smoked
4 children Rural 149.89 31.9 Unknown
5 Self-employed Rural 134.48 23.2 never smoked
6 children Urban 139.36 34.5 smokes
7 children Rural 167.76 31.2 formerly smoked
8 children Urban 102.79 37.2 never smoked
9 Private Urban 87.34 35.7 Unknown
10 Private Urban 63.28 18.4 never smoked
11 children Rural 205.38 28.4 Unknown
12 Private Urban 132.14 23.3 smokes
13 Self-employed Urban 81.61 14.0 never smoked
14 children Rural 114.62 30.1 smokes
15 Self-employed Urban 137.29 37.7 smokes
16 Never_worked Urban 135.24 20.4 Unknown
17 Private Urban 87.55 24.0 never smoked
18 Private Rural 55.88 24.1 Unknown
19 Self-employed Urban 56.21 38.0 never smoked
20 children Urban 94.29 21.9 Unknown
21 Private Urban 147.00 36.8 formerly smoked
22 children Rural 127.40 23.7 formerly smoked
23 Govt_job Urban 57.79 19.4 never smoked
24 children Rural 121.97 38.2 formerly smoked
25 Self-employed Urban 57.00 15.8 never smoked
26 children Rural 212.33 54.7 never smoked
27 Self-employed Rural 88.19 35.0 smokes
28 Private Urban 98.38 32.7 never smoked
29 Self-employed Urban 73.53 32.4 never smoked
30 Never_worked Rural 182.09 21.4 smokes
31 children Rural 117.16 23.3 smokes
32 Govt_job Rural 121.17 22.7 never smoked
33 Private Rural 202.05 32.9 never smoked
34 Self-employed Urban 62.05 18.7 Unknown
35 Private Rural 71.35 32.8 never smoked
36 children Urban 171.71 26.4 never smoked
37 Self-employed Rural 123.41 29.3 smokes
38 Private Urban 78.75 21.8 never smoked
39 Never_worked Rural 69.86 19.8 Unknown
40 Self-employed Rural 126.36 23.7 never smoked
41 Self-employed Urban 155.57 46.4 smokes
42 Private Urban 75.05 22.3 Unknown
43 Self-employed Urban 77.96 30.5 never smoked
44 Private Urban 113.75 26.0 formerly smoked
45 Self-employed Urban 105.87 41.3 smokes
46 Self-employed Rural 220.69 29.4 never smoked
47 children Rural 103.11 23.4 Unknown
stroke
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 1
23 1
24 0
25 0
26 0
27 0
28 0
29 0
30 1
31 1
32 0
33 0
34 0
35 1
36 1
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0 }
Conclusion¶
This time on the multi table generation we got a better coverage percentage on the quality report. This may be because of:
- Having 2.5x the synthetic data compared to the single table example;
- Using the GaussianCopula model for each table with HMA on multi table data generation vs. using the GaussianCopula model with FastML preset on single table data generation. The latter uses a normal distribution, no enforced rounding and the categorical transformer is a FrequencyEncoder instead of a LabelEncoder. More about these encoders on the source code.
It's also important to note that even though the SDV library has a tool to evaluate the quality of the data created we should have a third party library to also test this quality in order to be as unbiased as possible.
Now we know how to create synthetic data and how quick and secure it can be.