This is a presentation about Data privacy and anonymization. Mostly on a data person level by that I mean those who work with data and those who are working with data person. You can simulate data to make the insurance data set. See the folder layout to learn how to do it.
Folders:
.
├── codebook - this folder has a description of the simulated dataset. Particularly what the columns of the dataframe mean.
│ ├── Insurance_data_ke.txt - this was created with CSVkit (csvstat) function.
│ └── insurance_report.html - this is generated by pandas profiling library. A short cut in doing Exploratory data analysis fast.
├── data - directory where the simulated data should be placed. Run utils/dataloader.py to generate it.
│ ├── feature_engineered_insurance2.csv - data which has undergone feature engineering used in the demo.
│ ├── feature_engineered_insurance.csv - data which was created for the same problem but has issues. Create a new one.
│ ├── Insurance_data_ke.csv - The insurance dataset created by running python utils/dataloader.py
│ ├── Insurance_data_ke_featureeng.csv - Insurance dataset created as an intermediate step for feature engineering.
│ └── Organs.csv - Single patient data who was recovering from surgery from a heart disease. Just contains data about their vitals from a thermometer, pulse oximeter.
├── Dockerfile - a blueprint to run the project in a reproducible way see. # How to run in docker image.
├── environment.yml - a conda virtual environment file.
├── Kenya Data Protection Act - Quick Guide 2021.pdf - a demo for privacy engineering strategy at Deloitte.
├── Makefile - workflow orchestrator. Helps automating code formating and running repetitive tasks.
├── presentation - this directory has the presentations that were used live.
│ ├── presentation.pdf - HTML to PDF using LaTeX.
│ ├── presentation.slides.html - reveal.js presentation. Open with your browser.
| ├── presentation2.html - Quarto version of the presentation.
| ├── presentation2.pdf - PDF version of the presentation.
├── presentation.ipynb - jupyter notebook with jupyter notebook extensions and reveal.js extension.
├── README.md - the file you are reading.
├── requirements.txt - what packages were used.
├── Screenshot from 2022-09-10 07-03-38.png - demo of PCA using the iris dataset.
└── utils - Scripts used to generate the simulated data
├── codebook.sh - this is bash script used to create the codebook Insurance_data_ke.txt
├── dataloader.py - data generator that uses methods from the faker library and numpy.
├── Feature_engineering.ipynb - a feature engineering workflow that I use for making the insurance dataset ready for statistical modeling aka machine learning.
If you have anaconda/miniconda. In the data-privacy-pres directory, complete the following steps.
- Create the virtual environment
conda env create -f data-privacy-env.yml
- This will create an environment called data-privacy-env. You can activate it like this.
source activate data-privacy-env
Build docker image
sudo docker build -t data-privacy-env:v1 .
Run the docker image
sudo docker run -p 9999:9999 data-privacy-env:v1
- https://www.manning.com/books/data-privacy
- https://ethics.fast.ai/
- https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf
- https://www.manning.com/books/privacy-preserving-machine-learning
- https://www.manning.com/books/grokking-deep-learning
- https://www.manning.com/books/build-a-career-in-data-science
- https://www.datacamp.com/courses/data-privacy-and-anonymization-in-python