I want to make synthetic data based on real data.
Data sample:
session_id | session_date_time | session_status | mentor_domain_id | mentor_id | reg_date_mentor | region_id_mentor | mentee_id | reg_date_mentee | region_id_mentee | |
---|---|---|---|---|---|---|---|---|---|---|
5528 | 9165 | 2022-09-03 00:00:00 | finished | 5 | 20410 | 2022-04-28 00:00:00 | 6 | 11557 | 2021-05-15 00:00:00 | 3 |
2370 | 3891 | 2022-05-30 00:00:00 | canceled | 1 | 20879 | 2021-10-07 00:00:00 | 1 | 10154 | 2022-05-22 00:00:00 | 1 |
6473 | 10683 | 2022-09-15 00:00:00 | finished | 2 | 21457 | 2022-01-13 00:00:00 | 1 | 14505 | 2022-09-11 00:00:00 | 1 |
1671 | 2754 | 2022-04-22 00:00:00 | canceled | 6 | 21851 | 2021-08-24 00:00:00 | 1 | 13579 | 2021-09-12 00:00:00 | 2 |
324 | 527 | 2021-10-30 00:00:00 | finished | 1 | 22243 | 2021-07-04 00:00:00 | 1 | 14096 | 2021-10-10 00:00:00 | 10 |
4500 | 7453 | 2022-08-13 00:00:00 | finished | 4 | 22199 | 2021-12-02 00:00:00 | 5 | 11743 | 2021-11-01 00:00:00 | 8 |
2356 | 3875 | 2022-05-29 00:00:00 | finished | 2 | 21434 | 2022-04-29 00:00:00 | 4 | 14960 | 2021-12-12 00:00:00 | 0 |
2722 | 4491 | 2022-06-16 00:00:00 | finished | 2 | 21462 | 2022-06-05 00:00:00 | 7 | 12627 | 2021-02-23 00:00:00 | 2 |
6016 | 9929 | 2022-09-10 00:00:00 | finished | 1 | 20802 | 2021-08-07 00:00:00 | 1 | 10121 | 2022-07-30 00:00:00 | 1 |
4899 | 8121 | 2022-08-22 00:00:00 | finished | 1 | 24920 | 2021-10-19 00:00:00 | 5 | 12223 | 2022-07-04 00:00:00 | 4 |
This data is merged tables from database. I used it for my project.
I got many many SQL queries, few correlation matrix for this data and one non linear regression model.
First of all I need to make new data with similar properties (I can't use original data for my portfolio case). And it will be great if there's the way to generate data for longer time period.
Where should I start? Can I solve this problem with sklearn.datasets?
PS I already tryed Synthetic Data Vault and have failed. I can't use Faker, because I need to keep data structure.