Single Table Synthetic Data¶

Data analysis¶

We'll be using one table which has strokes data per person. This table was extracted from kaggle. Each entry has an id and a person_id and some related data to their health and a stroke field which tells us whether this person has had a stroke or not. As you can see bellow all of the columns are in conformity with GDPR. Meaning you cannot identify a person by any single or combined columns. However it has sensitive data which is important for us to masquerade. So we'll be doing that.

Load Data¶

First, we go to the content folder and get all csv files there.

In [1]:

Copied!





from sdv.datasets.local import load_csvs

try:
  datasets = load_csvs(folder_name='content/')
except ValueError:
  print('You have not uploaded any csv files. Using some demo data instead.')
from sdv.datasets.local import load_csvs

try:
  datasets = load_csvs(folder_name='content/')
except ValueError:
  print('You have not uploaded any csv files. Using some demo data instead.')

Then, we access the strokes table and display the first 20 rows.

In [2]:

Copied!

print(datasets.keys())

strokes_table = datasets['strokes']
strokes_table.head(20)
print(datasets.keys())

strokes_table = datasets['strokes']
strokes_table.head(20)

dict_keys(['people', 'strokes'])

Out[2]:

	id	person_id	gender	age	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status
0	4799	3205	Female	79	Yes	Self-employed	Urban	79.03	11.3	Unknown
1	4798	59993	Male	40	Yes	Private	Rural	60.96	11.5	never smoked
2	4797	20364	Female	4	No	children	Urban	107.25	12.0	Unknown
3	4796	45893	Female	8	No	children	Urban	106.51	12.3	Unknown
4	4795	52859	Female	4	No	children	Urban	61.54	13.2	Unknown
5	4794	4789	Male	8	No	children	Rural	91.54	13.4	Unknown
6	4793	25391	Female	10	No	children	Rural	69.84	13.7	Unknown
7	4792	48435	Female	2	No	children	Rural	155.14	13.7	Unknown
8	4791	60926	Male	5	No	children	Urban	79.89	13.8	Unknown
9	4790	6107	Female	5	No	children	Urban	77.88	13.8	Unknown
10	4789	24736	Female	4	No	children	Urban	94.27	14.0	Unknown
11	4788	28309	Female	67	Yes	Private	Urban	82.09	14.1	never smoked
12	4787	32560	Female	8	No	children	Rural	87.92	14.1	Unknown
13	4786	52447	Female	3	No	children	Rural	131.81	14.1	Unknown
14	4785	59762	Male	61	Yes	Private	Urban	227.98	14.2	Unknown
15	4784	18352	Female	3	No	children	Rural	108.32	14.2	Unknown
16	4783	72701	Male	2	No	children	Rural	112.66	14.2	Unknown
17	4782	51162	Female	11	No	children	Rural	122.75	14.3	Unknown
18	4781	33876	Male	10	No	children	Urban	87.09	14.3	Unknown
19	4780	61672	Female	11	No	children	Urban	69.68	14.4	Unknown

Create metadata¶

We then need to create the metadata object to be used when creating the synthesizer. SDV will detect some information from the table content but it may not be correct. It's always best to check the metadata and fix whatever needs to be fixed.

In [4]:

Copied!





from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(
    data=strokes_table
)

print('Auto detected data:\n')
print(metadata)
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(
    data=strokes_table
)

print('Auto detected data:\n')
print(metadata)

Auto detected data:

{
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
    "columns": {
        "id": {
            "sdtype": "numerical"
        },
        "person_id": {
            "sdtype": "numerical"
        },
        "gender": {
            "sdtype": "categorical"
        },
        "age": {
            "sdtype": "numerical"
        },
        "hypertension": {
            "sdtype": "numerical"
        },
        "heart_disease": {
            "sdtype": "numerical"
        },
        "ever_married": {
            "sdtype": "categorical"
        },
        "work_type": {
            "sdtype": "categorical"
        },
        "Residence_type": {
            "sdtype": "categorical"
        },
        "avg_glucose_level": {
            "sdtype": "numerical"
        },
        "bmi": {
            "sdtype": "numerical"
        },
        "smoking_status": {
            "sdtype": "categorical"
        },
        "stroke": {
            "sdtype": "numerical"
        }
    }
}

Edit Metadata¶

Bellow, we make a few changes such as:

Change column id to type id;
Change column age to a numerical type of integers;
Change column bmi to a numerical type of floats;
Change column stroke to a categorical column since the information is either 0 or 1;
Change hypertension and heart_disease columns to categorical as well for the same reason.

In [5]:

Copied!





metadata.update_column(
    column_name='id',
    sdtype='id'
)

metadata.update_column(
    column_name='age',
    sdtype='numerical',
    computer_representation="Int64"
)

metadata.update_column(
    column_name='bmi',
    sdtype='numerical',
    computer_representation="Float"
)

metadata.update_column(
    column_name='stroke',
    sdtype='categorical',
)

metadata.update_column(
    column_name='hypertension',
    sdtype='categorical',
)

metadata.update_column(
    column_name='heart_disease',
    sdtype='categorical',
)

print(metadata)
metadata.update_column(
    column_name='id',
    sdtype='id'
)

metadata.update_column(
    column_name='age',
    sdtype='numerical',
    computer_representation="Int64"
)

metadata.update_column(
    column_name='bmi',
    sdtype='numerical',
    computer_representation="Float"
)

metadata.update_column(
    column_name='stroke',
    sdtype='categorical',
)

metadata.update_column(
    column_name='hypertension',
    sdtype='categorical',
)

metadata.update_column(
    column_name='heart_disease',
    sdtype='categorical',
)

print(metadata)

{
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
    "columns": {
        "id": {
            "sdtype": "id"
        },
        "person_id": {
            "sdtype": "numerical"
        },
        "gender": {
            "sdtype": "categorical"
        },
        "age": {
            "sdtype": "numerical",
            "computer_representation": "Int64"
        },
        "hypertension": {
            "sdtype": "categorical"
        },
        "heart_disease": {
            "sdtype": "categorical"
        },
        "ever_married": {
            "sdtype": "categorical"
        },
        "work_type": {
            "sdtype": "categorical"
        },
        "Residence_type": {
            "sdtype": "categorical"
        },
        "avg_glucose_level": {
            "sdtype": "numerical"
        },
        "bmi": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "smoking_status": {
            "sdtype": "categorical"
        },
        "stroke": {
            "sdtype": "categorical"
        }
    }
}

In [6]:

Copied!





metadata.update_column(
    column_name='age',
    pii=True
)
metadata.update_column(
    column_name='age',
    pii=True
)

---------------------------------------------------------------------------
InvalidMetadataError                      Traceback (most recent call last)
Cell In[6], line 1
----> 1 metadata.update_column(
      2     column_name='age',
      3     pii=True
      4 )

File ~/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/metadata/single_table.py:228, in SingleTableMetadata.update_column(self, column_name, **kwargs)
    226     sdtype = self.columns[column_name]['sdtype']
    227     _kwargs['sdtype'] = sdtype
--> 228 self._validate_column(column_name, sdtype, **kwargs)
    229 self.columns[column_name] = _kwargs

File ~/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/metadata/single_table.py:149, in SingleTableMetadata._validate_column(self, column_name, sdtype, **kwargs)
    147 def _validate_column(self, column_name, sdtype, **kwargs):
    148     self._validate_sdtype(sdtype)
--> 149     self._validate_unexpected_kwargs(column_name, sdtype, **kwargs)
    150     if sdtype == 'categorical':
    151         self._validate_categorical(column_name, **kwargs)

File ~/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/metadata/single_table.py:134, in SingleTableMetadata._validate_unexpected_kwargs(self, column_name, sdtype, **kwargs)
    132 unexpected_kwargs = sorted(unexpected_kwargs)
    133 unexpected_kwargs = ', '.join(unexpected_kwargs)
--> 134 raise InvalidMetadataError(
    135     f"Invalid values '({unexpected_kwargs})' for {sdtype} column '{column_name}'.")

InvalidMetadataError: Invalid values '(pii)' for numerical column 'age'.

As you can see above, we then tried to say that Age is a PII column (Personally Identifiable Information) however that's not the case. Only as seen on the Synthetic Data - Introduction.

If you have a numerical value that is sensitive and needs to be protected but is not a PII, there are a few options you can consider:

Anonymization: Modify the numerical values to remove any direct or indirect identifiers. For example, you can apply techniques such as generalization, suppression, or randomization to de-identify the sensitive values. This approach allows you to retain the statistical properties of the data while protecting individual privacy.
Tokenization or Encoding: If the sensitive numerical values represent categorical or discrete data, you can consider tokenizing or encoding them. This involves replacing the original values with unique tokens or numeric representations, ensuring that the sensitive information cannot be directly inferred.
Aggregation or Binning: If the specific values of the numerical data are not crucial and the main focus is on preserving statistical properties, you can aggregate or group the values into ranges or bins. For example, you can convert age values into age groups (e.g., 20-30, 31-40) or income values into income brackets. This approach can help to maintain the overall distribution while adding a level of privacy protection.
Differential Privacy: Differential privacy is a concept that provides a rigorous mathematical framework for privacy protection. It involves injecting noise into the data or query responses in a controlled manner, ensuring that the privacy of individual data points is preserved. Differential privacy techniques can be applied to numerical data to protect sensitive information.

When handling sensitive numerical data, it's essential to comply with relevant privacy regulations and consider the specific requirements of your use case. Consult with legal and privacy experts to ensure that the chosen approach aligns with applicable laws and regulations, and adequately protects the privacy of individuals.

Since age is not a PII column and neither are any of the other columns of this table, we'll test this PII setting on our multitable notebook once we have a people table with personal information.

We then proceeded to identify the primary key column. In this case it was id.

In [7]:

Copied!

metadata.set_primary_key(column_name='id')
print(metadata)
metadata.set_primary_key(column_name='id')
print(metadata)

{
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
    "columns": {
        "id": {
            "sdtype": "id"
        },
        "person_id": {
            "sdtype": "numerical"
        },
        "gender": {
            "sdtype": "categorical"
        },
        "age": {
            "sdtype": "numerical",
            "computer_representation": "Int64"
        },
        "hypertension": {
            "sdtype": "categorical"
        },
        "heart_disease": {
            "sdtype": "categorical"
        },
        "ever_married": {
            "sdtype": "categorical"
        },
        "work_type": {
            "sdtype": "categorical"
        },
        "Residence_type": {
            "sdtype": "categorical"
        },
        "avg_glucose_level": {
            "sdtype": "numerical"
        },
        "bmi": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "smoking_status": {
            "sdtype": "categorical"
        },
        "stroke": {
            "sdtype": "categorical"
        }
    },
    "primary_key": "id"
}

Create Synthesizer¶

Having created the metadata object we then needed to create the synthesizer which will be trained to generate the synthetic data. For a first attempt we used a GaussianCopulaSynthesizer with a FAST_ML preset. Since we are just generating data for one table we used the SingleTablePreset from SDV library. We then trained the synthesizer using the fit method and got the results bellow. We then use the sample method to get the synthetic data. The parameters for this method beside num_rows are:

batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.
max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.
output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.

In [9]:

Copied!

from sdv.lite import SingleTablePreset

synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer.fit(strokes_table)

synthetic_data = synthesizer.sample(num_rows=5000)
synthetic_data
from sdv.lite import SingleTablePreset

synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer.fit(strokes_table)

synthetic_data = synthesizer.sample(num_rows=5000)
synthetic_data

Out[9]:

	id	person_id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	0	18483	Female	42	0	0	Yes	Private	Rural	138.949724	34.448243	Unknown	0
1	1	40268	Female	49	0	0	Yes	Self-employed	Rural	106.397658	41.427455	smokes	0
2	2	39773	Male	72	0	0	Yes	Private	Urban	146.924737	36.470968	smokes	0
3	3	29169	Female	31	0	0	No	Private	Rural	137.243082	20.695570	never smoked	0
4	4	23725	Female	8	0	0	Yes	Self-employed	Rural	127.135125	37.828633	Unknown	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
4995	4995	31770	Female	12	0	0	Yes	Private	Urban	80.745171	31.964878	Unknown	0
4996	4996	47371	Male	48	0	0	Yes	Private	Rural	133.521052	27.476454	formerly smoked	0
4997	4997	59809	Male	82	0	0	Yes	Private	Rural	127.595544	39.734157	Unknown	0
4998	4998	62847	Female	31	0	0	No	Private	Rural	110.706109	38.073594	never smoked	0
4999	4999	61172	Female	31	0	0	Yes	Private	Urban	133.384555	30.610890	never smoked	0

5000 rows × 13 columns

We can also condition the data we are interested in. In the case bellow we created a table with 250 rows with only people who have had strokes. However, conditional sampling sometimes does not work. To try and solve that problem you can resort to the troubleshooting section on the documentation.

In [10]:

Copied!





from sdv.sampling import Condition

with_stroke = Condition(
    num_rows=250,
    column_values={'stroke': 1}
)

synthetic_data_with_stroke = synthesizer.sample_from_conditions(
    conditions=[with_stroke],
)

synthetic_data_with_stroke
from sdv.sampling import Condition

with_stroke = Condition(
    num_rows=250,
    column_values={'stroke': 1}
)

synthetic_data_with_stroke = synthesizer.sample_from_conditions(
    conditions=[with_stroke],
)

synthetic_data_with_stroke

Sampling conditions: 100%|██████████| 250/250 [00:00<00:00, 4936.73it/s]

Out[10]:

	id	person_id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	5000	77	Male	60	0	0	Yes	Private	Rural	171.181426	26.095976	never smoked	1
1	5001	14831	Female	40	0	0	No	Self-employed	Urban	55.120000	20.928930	Unknown	1
2	5002	18073	Male	36	0	0	Yes	Self-employed	Rural	148.844070	39.191698	never smoked	1
3	5003	47225	Male	41	0	0	No	Private	Rural	105.012975	18.172498	never smoked	1
4	5004	34256	Female	60	0	0	Yes	Private	Rural	184.591717	41.743158	Unknown	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
245	5245	55172	Male	59	0	0	No	Private	Urban	125.669804	27.232660	never smoked	1
246	5246	42483	Male	28	0	0	Yes	Private	Rural	148.901177	26.345644	never smoked	1
247	5247	5554	Female	2	0	0	Yes	Self-employed	Rural	55.120000	36.977610	smokes	1
248	5248	63486	Female	82	0	0	Yes	Private	Rural	151.333343	28.523270	never smoked	1
249	5249	48875	Female	61	0	0	Yes	Private	Rural	177.863760	41.056034	never smoked	1

250 rows × 13 columns

Evaluation¶

The SDV library has a very powerful tool which allows you to evaluate the quality of the data generated by your synthesizer and also create a diagnostic report of that data.

In [11]:

Copied!





from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    metadata=metadata)
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    metadata=metadata)

Creating report: 100%|██████████| 4/4 [00:00<00:00, 10.81it/s]

Overall Quality Score: 89.33%

Properties:
Column Shapes: 92.66%
Column Pair Trends: 86.01%

In [10]:

Copied!





from sdv.evaluation.single_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    metadata=metadata)
from sdv.evaluation.single_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    metadata=metadata)

Creating report: 100%|██████████| 4/4 [00:15<00:00,  3.89s/it]

DiagnosticResults:

SUCCESS:
✓ The synthetic data covers over 90% of the categories present in the real data
✓ Over 90% of the synthetic rows are not copies of the real data
✓ The synthetic data follows over 90% of the min/max boundaries set by the real data

WARNING:
! The synthetic data is missing more than 10% of the numerical ranges present in the real data

In [11]:

Copied!

diagnostic_report.get_results()
diagnostic_report.get_results()

Out[11]:

{'SUCCESS': ['The synthetic data covers over 90% of the categories present in the real data',
  'Over 90% of the synthetic rows are not copies of the real data',
  'The synthetic data follows over 90% of the min/max boundaries set by the real data'],
 'WARNING': ['The synthetic data is missing more than 10% of the numerical ranges present in the real data'],
 'DANGER': []}

In [12]:

Copied!

diagnostic_report.get_properties()
diagnostic_report.get_properties()

Out[12]:

{'Coverage': 0.9592604702286464, 'Synthesis': 1.0, 'Boundaries': 1.0}

In [13]:

Copied!

diagnostic_report.get_details(property_name='Coverage')
diagnostic_report.get_details(property_name='Coverage')

Out[13]:

	Column	Metric	Diagnostic Score
0	person_id	RangeCoverage	1.000000
1	age	RangeCoverage	1.000000
2	avg_glucose_level	RangeCoverage	1.000000
3	bmi	RangeCoverage	0.511126
4	gender	CategoryCoverage	1.000000
5	hypertension	CategoryCoverage	1.000000
6	heart_disease	CategoryCoverage	1.000000
7	ever_married	CategoryCoverage	1.000000
8	work_type	CategoryCoverage	1.000000
9	Residence_type	CategoryCoverage	1.000000
10	smoking_status	CategoryCoverage	1.000000
11	stroke	CategoryCoverage	1.000000

It also allows you to visualize that comparison.

In [14]:

Copied!





from sdv.evaluation.single_table import get_column_pair_plot

fig = get_column_pair_plot(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    column_names=['age', 'bmi'],
    metadata=metadata)
    
fig.show()
from sdv.evaluation.single_table import get_column_pair_plot

fig = get_column_pair_plot(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    column_names=['age', 'bmi'],
    metadata=metadata)
    
fig.show()

In [15]:

Copied!





from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    column_name='bmi',
    metadata=metadata
)
    
fig.show()
from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=strokes_table,
    synthetic_data=synthetic_data,
    column_name='bmi',
    metadata=metadata
)
    
fig.show()

BMI Fix¶

As you can see BMI was the column with worst range coverage and probably what's bringing the evaluation down. We can start by analysing what BMI really is. BMI is an index that is calculated through: $$weight/height^2$$ So we'll add some constraints so the values generated are within the same range as the original data. We'll also use the truncnorm distribution which basically is a normal distribution but within a range. The values used for low and high values on the constraints are the same as the low and high values on the original data.

In [16]:

Copied!





from sdv.single_table import GaussianCopulaSynthesizer

my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'bmi',
        'low_value': 11.3,
        'high_value': 97.6,
        'strict_boundaries': False
    }
}

my_constraint2 = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'avg_glucose_level',
        'low_value': 55.12,
        'high_value': 271.74,
        'strict_boundaries': False
    }
}

synthesizer = GaussianCopulaSynthesizer(metadata, numerical_distributions={"bmi":"truncnorm", "avg_glucose_level":"truncnorm"})

synthesizer.add_constraints(constraints=[
    my_constraint
])

synthesizer.add_constraints(constraints=[
    my_constraint2
])

synthesizer.fit(strokes_table)

synthetic_data_2 = synthesizer.sample(num_rows=4000)
from sdv.single_table import GaussianCopulaSynthesizer

my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'bmi',
        'low_value': 11.3,
        'high_value': 97.6,
        'strict_boundaries': False
    }
}

my_constraint2 = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'avg_glucose_level',
        'low_value': 55.12,
        'high_value': 271.74,
        'strict_boundaries': False
    }
}

synthesizer = GaussianCopulaSynthesizer(metadata, numerical_distributions={"bmi":"truncnorm", "avg_glucose_level":"truncnorm"})

synthesizer.add_constraints(constraints=[
    my_constraint
])

synthesizer.add_constraints(constraints=[
    my_constraint2
])

synthesizer.fit(strokes_table)

synthetic_data_2 = synthesizer.sample(num_rows=4000)

Sampling rows: 100%|██████████| 4000/4000 [00:00<00:00, 28692.59it/s]

New Evaluation¶

As you can see using a customized GaussianCopulaSynthesizer module to synthesize data was not enough to improve the quality of said data. We could also use Neural network-based synthesizers such as CTGAN, TVAE or CopulaGAN synthesizers But for now that analysis is not tested.

In [17]:

Copied!





from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=strokes_table,
    synthetic_data=synthetic_data_2,
    metadata=metadata)
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=strokes_table,
    synthetic_data=synthetic_data_2,
    metadata=metadata)

Creating report: 100%|██████████| 4/4 [00:00<00:00, 13.10it/s]

Overall Quality Score: 88.48%

Properties:
Column Shapes: 91.81%
Column Pair Trends: 85.16%

In [18]:

Copied!





from sdv.evaluation.single_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=strokes_table,
    synthetic_data=synthetic_data_2,
    metadata=metadata)

synthesizer.get_learned_distributions()
from sdv.evaluation.single_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=strokes_table,
    synthetic_data=synthetic_data_2,
    metadata=metadata)

synthesizer.get_learned_distributions()

Creating report: 100%|██████████| 4/4 [00:12<00:00,  3.23s/it]

DiagnosticResults:

SUCCESS:
✓ The synthetic data covers over 90% of the numerical ranges present in the real data
✓ The synthetic data covers over 90% of the categories present in the real data
✓ Over 90% of the synthetic rows are not copies of the real data
✓ The synthetic data follows over 90% of the min/max boundaries set by the real data

Out[18]:

{'person_id': {'distribution': 'beta',
  'learned_parameters': {'loc': 80.25890325886454,
   'scale': 72859.74109674116,
   'a': 0.9969547367188392,
   'b': 0.9716645550572578}},
 'gender': {'distribution': 'beta',
  'learned_parameters': {'loc': -2.5252690911627557e-05,
   'scale': 2.429965774475864,
   'a': 1.1674347481757146,
   'b': 2.031096048032575}},
 'age': {'distribution': 'beta',
  'learned_parameters': {'loc': 1.2374170728627083,
   'scale': 80.7625829271373,
   'a': 0.9923267833484788,
   'b': 0.8396338204631779}},
 'hypertension': {'distribution': 'beta',
  'learned_parameters': {'loc': -0.0015033493750864929,
   'scale': 2.6734823331569064,
   'a': 1.298928537009436,
   'b': 4.583705567012164}},
 'heart_disease': {'distribution': 'beta',
  'learned_parameters': {'loc': -0.0011121211788436343,
   'scale': 2.5118242709872747,
   'a': 1.3920048676282983,
   'b': 5.027246739017915}},
 'ever_married': {'distribution': 'beta',
  'learned_parameters': {'loc': 0.00015138900059352697,
   'scale': 2.000541483317777,
   'a': 0.9231752283513573,
   'b': 1.2510098798236158}},
 'work_type': {'distribution': 'beta',
  'learned_parameters': {'loc': -0.19329558094923285,
   'scale': 136617.51361846982,
   'a': 3.08290050373147,
   'b': 205131.79060185206}},
 'Residence_type': {'distribution': 'beta',
  'learned_parameters': {'loc': 0.0005564993489894335,
   'scale': 1.9990261104896918,
   'a': 0.9592938015075503,
   'b': 0.9884101783243173}},
 'smoking_status': {'distribution': 'beta',
  'learned_parameters': {'loc': 0.00016896520737423191,
   'scale': 4.015279314682994,
   'a': 1.0265460598108,
   'b': 1.3662274288516505}},
 'stroke': {'distribution': 'beta',
  'learned_parameters': {'loc': 0.000195975643300648,
   'scale': 2.3818244739318652,
   'a': 1.3553887073649618,
   'b': 4.673247660622257}},
 'bmi#11.3#97.6': {'distribution': 'beta',
  'learned_parameters': {'loc': -16.755951983495105,
   'scale': 51.377743910445545,
   'a': 630.7390090100525,
   'b': 1469.4336390811734}},
 'avg_glucose_level#55.12#271.74': {'distribution': 'beta',
  'learned_parameters': {'loc': -4.597277640616451,
   'scale': 625222.6174629636,
   'a': 9.019129645706993,
   'b': 1728046.3185030504}}}

In [19]:

Copied!

diagnostic_report.get_details(property_name='Coverage')
diagnostic_report.get_details(property_name='Coverage')

Out[19]:

	Column	Metric	Diagnostic Score
0	person_id	RangeCoverage	0.999671
1	age	RangeCoverage	1.000000
2	avg_glucose_level	RangeCoverage	1.000000
3	bmi	RangeCoverage	0.631518
4	gender	CategoryCoverage	1.000000
5	hypertension	CategoryCoverage	1.000000
6	heart_disease	CategoryCoverage	1.000000
7	ever_married	CategoryCoverage	1.000000
8	work_type	CategoryCoverage	1.000000
9	Residence_type	CategoryCoverage	1.000000
10	smoking_status	CategoryCoverage	1.000000
11	stroke	CategoryCoverage	1.000000

In [30]:

Copied!

quality_report.get_details(property_name='Column Shapes')
quality_report.get_details(property_name='Column Shapes')

Out[30]:

	Column	Metric	Quality Score
0	person_id	KSComplement	0.987458
1	age	KSComplement	0.912625
2	avg_glucose_level	KSComplement	0.913750
3	bmi	KSComplement	0.966333
4	gender	TVComplement	0.954792
5	hypertension	TVComplement	0.923458
6	heart_disease	TVComplement	0.910125
7	ever_married	TVComplement	0.924000
8	work_type	TVComplement	0.706875
9	Residence_type	TVComplement	0.996333
10	smoking_status	TVComplement	0.902667
11	stroke	TVComplement	0.918542