Single Table Synthetic Data¶
Data analysis¶
We'll be using one table which has strokes data per person. This table was extracted from kaggle. Each entry has an id and a person_id and some related data to their health and a stroke field which tells us whether this person has had a stroke or not. As you can see bellow all of the columns are in conformity with GDPR. Meaning you cannot identify a person by any single or combined columns. However it has sensitive data which is important for us to masquerade. So we'll be doing that.
Load Data¶
First, we go to the content folder and get all csv files there.
from sdv.datasets.local import load_csvs
try:
datasets = load_csvs(folder_name='content/')
except ValueError:
print('You have not uploaded any csv files. Using some demo data instead.')
Then, we access the strokes table and display the first 20 rows.
print(datasets.keys())
strokes_table = datasets['strokes']
strokes_table.head(20)
dict_keys(['people', 'strokes'])
id | person_id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4799 | 3205 | Female | 79 | 0 | 0 | Yes | Self-employed | Urban | 79.03 | 11.3 | Unknown | 0 |
1 | 4798 | 59993 | Male | 40 | 0 | 0 | Yes | Private | Rural | 60.96 | 11.5 | never smoked | 0 |
2 | 4797 | 20364 | Female | 4 | 0 | 0 | No | children | Urban | 107.25 | 12.0 | Unknown | 0 |
3 | 4796 | 45893 | Female | 8 | 0 | 0 | No | children | Urban | 106.51 | 12.3 | Unknown | 0 |
4 | 4795 | 52859 | Female | 4 | 0 | 0 | No | children | Urban | 61.54 | 13.2 | Unknown | 0 |
5 | 4794 | 4789 | Male | 8 | 0 | 0 | No | children | Rural | 91.54 | 13.4 | Unknown | 0 |
6 | 4793 | 25391 | Female | 10 | 0 | 0 | No | children | Rural | 69.84 | 13.7 | Unknown | 0 |
7 | 4792 | 48435 | Female | 2 | 0 | 0 | No | children | Rural | 155.14 | 13.7 | Unknown | 0 |
8 | 4791 | 60926 | Male | 5 | 0 | 0 | No | children | Urban | 79.89 | 13.8 | Unknown | 0 |
9 | 4790 | 6107 | Female | 5 | 0 | 0 | No | children | Urban | 77.88 | 13.8 | Unknown | 0 |
10 | 4789 | 24736 | Female | 4 | 0 | 0 | No | children | Urban | 94.27 | 14.0 | Unknown | 0 |
11 | 4788 | 28309 | Female | 67 | 0 | 0 | Yes | Private | Urban | 82.09 | 14.1 | never smoked | 0 |
12 | 4787 | 32560 | Female | 8 | 0 | 0 | No | children | Rural | 87.92 | 14.1 | Unknown | 0 |
13 | 4786 | 52447 | Female | 3 | 0 | 0 | No | children | Rural | 131.81 | 14.1 | Unknown | 0 |
14 | 4785 | 59762 | Male | 61 | 0 | 0 | Yes | Private | Urban | 227.98 | 14.2 | Unknown | 0 |
15 | 4784 | 18352 | Female | 3 | 0 | 0 | No | children | Rural | 108.32 | 14.2 | Unknown | 0 |
16 | 4783 | 72701 | Male | 2 | 0 | 0 | No | children | Rural | 112.66 | 14.2 | Unknown | 0 |
17 | 4782 | 51162 | Female | 11 | 0 | 0 | No | children | Rural | 122.75 | 14.3 | Unknown | 0 |
18 | 4781 | 33876 | Male | 10 | 0 | 0 | No | children | Urban | 87.09 | 14.3 | Unknown | 0 |
19 | 4780 | 61672 | Female | 11 | 0 | 0 | No | children | Urban | 69.68 | 14.4 | Unknown | 0 |
Create metadata¶
We then need to create the metadata object to be used when creating the synthesizer. SDV will detect some information from the table content but it may not be correct. It's always best to check the metadata and fix whatever needs to be fixed.
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(
data=strokes_table
)
print('Auto detected data:\n')
print(metadata)
Auto detected data: { "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1", "columns": { "id": { "sdtype": "numerical" }, "person_id": { "sdtype": "numerical" }, "gender": { "sdtype": "categorical" }, "age": { "sdtype": "numerical" }, "hypertension": { "sdtype": "numerical" }, "heart_disease": { "sdtype": "numerical" }, "ever_married": { "sdtype": "categorical" }, "work_type": { "sdtype": "categorical" }, "Residence_type": { "sdtype": "categorical" }, "avg_glucose_level": { "sdtype": "numerical" }, "bmi": { "sdtype": "numerical" }, "smoking_status": { "sdtype": "categorical" }, "stroke": { "sdtype": "numerical" } } }
Edit Metadata¶
Bellow, we make a few changes such as:
- Change column id to type id;
- Change column age to a numerical type of integers;
- Change column bmi to a numerical type of floats;
- Change column stroke to a categorical column since the information is either 0 or 1;
- Change hypertension and heart_disease columns to categorical as well for the same reason.
metadata.update_column(
column_name='id',
sdtype='id'
)
metadata.update_column(
column_name='age',
sdtype='numerical',
computer_representation="Int64"
)
metadata.update_column(
column_name='bmi',
sdtype='numerical',
computer_representation="Float"
)
metadata.update_column(
column_name='stroke',
sdtype='categorical',
)
metadata.update_column(
column_name='hypertension',
sdtype='categorical',
)
metadata.update_column(
column_name='heart_disease',
sdtype='categorical',
)
print(metadata)
{ "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1", "columns": { "id": { "sdtype": "id" }, "person_id": { "sdtype": "numerical" }, "gender": { "sdtype": "categorical" }, "age": { "sdtype": "numerical", "computer_representation": "Int64" }, "hypertension": { "sdtype": "categorical" }, "heart_disease": { "sdtype": "categorical" }, "ever_married": { "sdtype": "categorical" }, "work_type": { "sdtype": "categorical" }, "Residence_type": { "sdtype": "categorical" }, "avg_glucose_level": { "sdtype": "numerical" }, "bmi": { "sdtype": "numerical", "computer_representation": "Float" }, "smoking_status": { "sdtype": "categorical" }, "stroke": { "sdtype": "categorical" } } }
metadata.update_column(
column_name='age',
pii=True
)
--------------------------------------------------------------------------- InvalidMetadataError Traceback (most recent call last) Cell In[6], line 1 ----> 1 metadata.update_column( 2 column_name='age', 3 pii=True 4 ) File ~/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/metadata/single_table.py:228, in SingleTableMetadata.update_column(self, column_name, **kwargs) 226 sdtype = self.columns[column_name]['sdtype'] 227 _kwargs['sdtype'] = sdtype --> 228 self._validate_column(column_name, sdtype, **kwargs) 229 self.columns[column_name] = _kwargs File ~/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/metadata/single_table.py:149, in SingleTableMetadata._validate_column(self, column_name, sdtype, **kwargs) 147 def _validate_column(self, column_name, sdtype, **kwargs): 148 self._validate_sdtype(sdtype) --> 149 self._validate_unexpected_kwargs(column_name, sdtype, **kwargs) 150 if sdtype == 'categorical': 151 self._validate_categorical(column_name, **kwargs) File ~/Library/Caches/pypoetry/virtualenvs/synthetic-data-EqHpLbmO-py3.10/lib/python3.10/site-packages/sdv/metadata/single_table.py:134, in SingleTableMetadata._validate_unexpected_kwargs(self, column_name, sdtype, **kwargs) 132 unexpected_kwargs = sorted(unexpected_kwargs) 133 unexpected_kwargs = ', '.join(unexpected_kwargs) --> 134 raise InvalidMetadataError( 135 f"Invalid values '({unexpected_kwargs})' for {sdtype} column '{column_name}'.") InvalidMetadataError: Invalid values '(pii)' for numerical column 'age'.
As you can see above, we then tried to say that Age is a PII column (Personally Identifiable Information) however that's not the case. Only as seen on the Synthetic Data - Introduction.
If you have a numerical value that is sensitive and needs to be protected but is not a PII, there are a few options you can consider:
- Anonymization: Modify the numerical values to remove any direct or indirect identifiers. For example, you can apply techniques such as generalization, suppression, or randomization to de-identify the sensitive values. This approach allows you to retain the statistical properties of the data while protecting individual privacy.
- Tokenization or Encoding: If the sensitive numerical values represent categorical or discrete data, you can consider tokenizing or encoding them. This involves replacing the original values with unique tokens or numeric representations, ensuring that the sensitive information cannot be directly inferred.
- Aggregation or Binning: If the specific values of the numerical data are not crucial and the main focus is on preserving statistical properties, you can aggregate or group the values into ranges or bins. For example, you can convert age values into age groups (e.g., 20-30, 31-40) or income values into income brackets. This approach can help to maintain the overall distribution while adding a level of privacy protection.
- Differential Privacy: Differential privacy is a concept that provides a rigorous mathematical framework for privacy protection. It involves injecting noise into the data or query responses in a controlled manner, ensuring that the privacy of individual data points is preserved. Differential privacy techniques can be applied to numerical data to protect sensitive information.
When handling sensitive numerical data, it's essential to comply with relevant privacy regulations and consider the specific requirements of your use case. Consult with legal and privacy experts to ensure that the chosen approach aligns with applicable laws and regulations, and adequately protects the privacy of individuals.
Since age is not a PII column and neither are any of the other columns of this table, we'll test this PII setting on our multitable notebook once we have a people table with personal information.
We then proceeded to identify the primary key column. In this case it was id.
metadata.set_primary_key(column_name='id')
print(metadata)
{ "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1", "columns": { "id": { "sdtype": "id" }, "person_id": { "sdtype": "numerical" }, "gender": { "sdtype": "categorical" }, "age": { "sdtype": "numerical", "computer_representation": "Int64" }, "hypertension": { "sdtype": "categorical" }, "heart_disease": { "sdtype": "categorical" }, "ever_married": { "sdtype": "categorical" }, "work_type": { "sdtype": "categorical" }, "Residence_type": { "sdtype": "categorical" }, "avg_glucose_level": { "sdtype": "numerical" }, "bmi": { "sdtype": "numerical", "computer_representation": "Float" }, "smoking_status": { "sdtype": "categorical" }, "stroke": { "sdtype": "categorical" } }, "primary_key": "id" }
Create Synthesizer¶
Having created the metadata object we then needed to create the synthesizer which will be trained to generate the synthetic data. For a first attempt we used a GaussianCopulaSynthesizer with a FAST_ML preset. Since we are just generating data for one table we used the SingleTablePreset from SDV library. We then trained the synthesizer using the fit method and got the results bellow. We then use the sample method to get the synthetic data. The parameters for this method beside num_rows are:
- batch_size: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.
- max_tries_per_batch: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.
- output_file_path: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.
from sdv.lite import SingleTablePreset
synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer.fit(strokes_table)
synthetic_data = synthesizer.sample(num_rows=5000)
synthetic_data
id | person_id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 18483 | Female | 42 | 0 | 0 | Yes | Private | Rural | 138.949724 | 34.448243 | Unknown | 0 |
1 | 1 | 40268 | Female | 49 | 0 | 0 | Yes | Self-employed | Rural | 106.397658 | 41.427455 | smokes | 0 |
2 | 2 | 39773 | Male | 72 | 0 | 0 | Yes | Private | Urban | 146.924737 | 36.470968 | smokes | 0 |
3 | 3 | 29169 | Female | 31 | 0 | 0 | No | Private | Rural | 137.243082 | 20.695570 | never smoked | 0 |
4 | 4 | 23725 | Female | 8 | 0 | 0 | Yes | Self-employed | Rural | 127.135125 | 37.828633 | Unknown | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4995 | 4995 | 31770 | Female | 12 | 0 | 0 | Yes | Private | Urban | 80.745171 | 31.964878 | Unknown | 0 |
4996 | 4996 | 47371 | Male | 48 | 0 | 0 | Yes | Private | Rural | 133.521052 | 27.476454 | formerly smoked | 0 |
4997 | 4997 | 59809 | Male | 82 | 0 | 0 | Yes | Private | Rural | 127.595544 | 39.734157 | Unknown | 0 |
4998 | 4998 | 62847 | Female | 31 | 0 | 0 | No | Private | Rural | 110.706109 | 38.073594 | never smoked | 0 |
4999 | 4999 | 61172 | Female | 31 | 0 | 0 | Yes | Private | Urban | 133.384555 | 30.610890 | never smoked | 0 |
5000 rows × 13 columns
We can also condition the data we are interested in. In the case bellow we created a table with 250 rows with only people who have had strokes. However, conditional sampling sometimes does not work. To try and solve that problem you can resort to the troubleshooting section on the documentation.
from sdv.sampling import Condition
with_stroke = Condition(
num_rows=250,
column_values={'stroke': 1}
)
synthetic_data_with_stroke = synthesizer.sample_from_conditions(
conditions=[with_stroke],
)
synthetic_data_with_stroke
Sampling conditions: 100%|██████████| 250/250 [00:00<00:00, 4936.73it/s]
id | person_id | gender | age | hypertension | heart_disease | ever_married | work_type | Residence_type | avg_glucose_level | bmi | smoking_status | stroke | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5000 | 77 | Male | 60 | 0 | 0 | Yes | Private | Rural | 171.181426 | 26.095976 | never smoked | 1 |
1 | 5001 | 14831 | Female | 40 | 0 | 0 | No | Self-employed | Urban | 55.120000 | 20.928930 | Unknown | 1 |
2 | 5002 | 18073 | Male | 36 | 0 | 0 | Yes | Self-employed | Rural | 148.844070 | 39.191698 | never smoked | 1 |
3 | 5003 | 47225 | Male | 41 | 0 | 0 | No | Private | Rural | 105.012975 | 18.172498 | never smoked | 1 |
4 | 5004 | 34256 | Female | 60 | 0 | 0 | Yes | Private | Rural | 184.591717 | 41.743158 | Unknown | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
245 | 5245 | 55172 | Male | 59 | 0 | 0 | No | Private | Urban | 125.669804 | 27.232660 | never smoked | 1 |
246 | 5246 | 42483 | Male | 28 | 0 | 0 | Yes | Private | Rural | 148.901177 | 26.345644 | never smoked | 1 |
247 | 5247 | 5554 | Female | 2 | 0 | 0 | Yes | Self-employed | Rural | 55.120000 | 36.977610 | smokes | 1 |
248 | 5248 | 63486 | Female | 82 | 0 | 0 | Yes | Private | Rural | 151.333343 | 28.523270 | never smoked | 1 |
249 | 5249 | 48875 | Female | 61 | 0 | 0 | Yes | Private | Rural | 177.863760 | 41.056034 | never smoked | 1 |
250 rows × 13 columns
Evaluation¶
The SDV library has a very powerful tool which allows you to evaluate the quality of the data generated by your synthesizer and also create a diagnostic report of that data.
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
real_data=strokes_table,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|██████████| 4/4 [00:00<00:00, 10.81it/s]
Overall Quality Score: 89.33% Properties: Column Shapes: 92.66% Column Pair Trends: 86.01%
from sdv.evaluation.single_table import run_diagnostic
diagnostic_report = run_diagnostic(
real_data=strokes_table,
synthetic_data=synthetic_data,
metadata=metadata)
Creating report: 100%|██████████| 4/4 [00:15<00:00, 3.89s/it]
DiagnosticResults: SUCCESS: ✓ The synthetic data covers over 90% of the categories present in the real data ✓ Over 90% of the synthetic rows are not copies of the real data ✓ The synthetic data follows over 90% of the min/max boundaries set by the real data WARNING: ! The synthetic data is missing more than 10% of the numerical ranges present in the real data
diagnostic_report.get_results()
{'SUCCESS': ['The synthetic data covers over 90% of the categories present in the real data', 'Over 90% of the synthetic rows are not copies of the real data', 'The synthetic data follows over 90% of the min/max boundaries set by the real data'], 'WARNING': ['The synthetic data is missing more than 10% of the numerical ranges present in the real data'], 'DANGER': []}
diagnostic_report.get_properties()
{'Coverage': 0.9592604702286464, 'Synthesis': 1.0, 'Boundaries': 1.0}
diagnostic_report.get_details(property_name='Coverage')
Column | Metric | Diagnostic Score | |
---|---|---|---|
0 | person_id | RangeCoverage | 1.000000 |
1 | age | RangeCoverage | 1.000000 |
2 | avg_glucose_level | RangeCoverage | 1.000000 |
3 | bmi | RangeCoverage | 0.511126 |
4 | gender | CategoryCoverage | 1.000000 |
5 | hypertension | CategoryCoverage | 1.000000 |
6 | heart_disease | CategoryCoverage | 1.000000 |
7 | ever_married | CategoryCoverage | 1.000000 |
8 | work_type | CategoryCoverage | 1.000000 |
9 | Residence_type | CategoryCoverage | 1.000000 |
10 | smoking_status | CategoryCoverage | 1.000000 |
11 | stroke | CategoryCoverage | 1.000000 |
It also allows you to visualize that comparison.
from sdv.evaluation.single_table import get_column_pair_plot
fig = get_column_pair_plot(
real_data=strokes_table,
synthetic_data=synthetic_data,
column_names=['age', 'bmi'],
metadata=metadata)
fig.show()
from sdv.evaluation.single_table import get_column_plot
fig = get_column_plot(
real_data=strokes_table,
synthetic_data=synthetic_data,
column_name='bmi',
metadata=metadata
)
fig.show()
BMI Fix¶
As you can see BMI was the column with worst range coverage and probably what's bringing the evaluation down. We can start by analysing what BMI really is. BMI is an index that is calculated through: $$weight/height^2$$ So we'll add some constraints so the values generated are within the same range as the original data. We'll also use the truncnorm distribution which basically is a normal distribution but within a range. The values used for low and high values on the constraints are the same as the low and high values on the original data.
from sdv.single_table import GaussianCopulaSynthesizer
my_constraint = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'bmi',
'low_value': 11.3,
'high_value': 97.6,
'strict_boundaries': False
}
}
my_constraint2 = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'avg_glucose_level',
'low_value': 55.12,
'high_value': 271.74,
'strict_boundaries': False
}
}
synthesizer = GaussianCopulaSynthesizer(metadata, numerical_distributions={"bmi":"truncnorm", "avg_glucose_level":"truncnorm"})
synthesizer.add_constraints(constraints=[
my_constraint
])
synthesizer.add_constraints(constraints=[
my_constraint2
])
synthesizer.fit(strokes_table)
synthetic_data_2 = synthesizer.sample(num_rows=4000)
Sampling rows: 100%|██████████| 4000/4000 [00:00<00:00, 28692.59it/s]
New Evaluation¶
As you can see using a customized GaussianCopulaSynthesizer module to synthesize data was not enough to improve the quality of said data. We could also use Neural network-based synthesizers such as CTGAN, TVAE or CopulaGAN synthesizers But for now that analysis is not tested.
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
real_data=strokes_table,
synthetic_data=synthetic_data_2,
metadata=metadata)
Creating report: 100%|██████████| 4/4 [00:00<00:00, 13.10it/s]
Overall Quality Score: 88.48% Properties: Column Shapes: 91.81% Column Pair Trends: 85.16%
from sdv.evaluation.single_table import run_diagnostic
diagnostic_report = run_diagnostic(
real_data=strokes_table,
synthetic_data=synthetic_data_2,
metadata=metadata)
synthesizer.get_learned_distributions()
Creating report: 100%|██████████| 4/4 [00:12<00:00, 3.23s/it]
DiagnosticResults: SUCCESS: ✓ The synthetic data covers over 90% of the numerical ranges present in the real data ✓ The synthetic data covers over 90% of the categories present in the real data ✓ Over 90% of the synthetic rows are not copies of the real data ✓ The synthetic data follows over 90% of the min/max boundaries set by the real data
{'person_id': {'distribution': 'beta', 'learned_parameters': {'loc': 80.25890325886454, 'scale': 72859.74109674116, 'a': 0.9969547367188392, 'b': 0.9716645550572578}}, 'gender': {'distribution': 'beta', 'learned_parameters': {'loc': -2.5252690911627557e-05, 'scale': 2.429965774475864, 'a': 1.1674347481757146, 'b': 2.031096048032575}}, 'age': {'distribution': 'beta', 'learned_parameters': {'loc': 1.2374170728627083, 'scale': 80.7625829271373, 'a': 0.9923267833484788, 'b': 0.8396338204631779}}, 'hypertension': {'distribution': 'beta', 'learned_parameters': {'loc': -0.0015033493750864929, 'scale': 2.6734823331569064, 'a': 1.298928537009436, 'b': 4.583705567012164}}, 'heart_disease': {'distribution': 'beta', 'learned_parameters': {'loc': -0.0011121211788436343, 'scale': 2.5118242709872747, 'a': 1.3920048676282983, 'b': 5.027246739017915}}, 'ever_married': {'distribution': 'beta', 'learned_parameters': {'loc': 0.00015138900059352697, 'scale': 2.000541483317777, 'a': 0.9231752283513573, 'b': 1.2510098798236158}}, 'work_type': {'distribution': 'beta', 'learned_parameters': {'loc': -0.19329558094923285, 'scale': 136617.51361846982, 'a': 3.08290050373147, 'b': 205131.79060185206}}, 'Residence_type': {'distribution': 'beta', 'learned_parameters': {'loc': 0.0005564993489894335, 'scale': 1.9990261104896918, 'a': 0.9592938015075503, 'b': 0.9884101783243173}}, 'smoking_status': {'distribution': 'beta', 'learned_parameters': {'loc': 0.00016896520737423191, 'scale': 4.015279314682994, 'a': 1.0265460598108, 'b': 1.3662274288516505}}, 'stroke': {'distribution': 'beta', 'learned_parameters': {'loc': 0.000195975643300648, 'scale': 2.3818244739318652, 'a': 1.3553887073649618, 'b': 4.673247660622257}}, 'bmi#11.3#97.6': {'distribution': 'beta', 'learned_parameters': {'loc': -16.755951983495105, 'scale': 51.377743910445545, 'a': 630.7390090100525, 'b': 1469.4336390811734}}, 'avg_glucose_level#55.12#271.74': {'distribution': 'beta', 'learned_parameters': {'loc': -4.597277640616451, 'scale': 625222.6174629636, 'a': 9.019129645706993, 'b': 1728046.3185030504}}}
diagnostic_report.get_details(property_name='Coverage')
Column | Metric | Diagnostic Score | |
---|---|---|---|
0 | person_id | RangeCoverage | 0.999671 |
1 | age | RangeCoverage | 1.000000 |
2 | avg_glucose_level | RangeCoverage | 1.000000 |
3 | bmi | RangeCoverage | 0.631518 |
4 | gender | CategoryCoverage | 1.000000 |
5 | hypertension | CategoryCoverage | 1.000000 |
6 | heart_disease | CategoryCoverage | 1.000000 |
7 | ever_married | CategoryCoverage | 1.000000 |
8 | work_type | CategoryCoverage | 1.000000 |
9 | Residence_type | CategoryCoverage | 1.000000 |
10 | smoking_status | CategoryCoverage | 1.000000 |
11 | stroke | CategoryCoverage | 1.000000 |
quality_report.get_details(property_name='Column Shapes')
Column | Metric | Quality Score | |
---|---|---|---|
0 | person_id | KSComplement | 0.987458 |
1 | age | KSComplement | 0.912625 |
2 | avg_glucose_level | KSComplement | 0.913750 |
3 | bmi | KSComplement | 0.966333 |
4 | gender | TVComplement | 0.954792 |
5 | hypertension | TVComplement | 0.923458 |
6 | heart_disease | TVComplement | 0.910125 |
7 | ever_married | TVComplement | 0.924000 |
8 | work_type | TVComplement | 0.706875 |
9 | Residence_type | TVComplement | 0.996333 |
10 | smoking_status | TVComplement | 0.902667 |
11 | stroke | TVComplement | 0.918542 |