How to create synthetic data based on real data?

Question

I want to make synthetic data based on real data.

Data sample:

	session_id	session_date_time	session_status	mentor_domain_id	mentor_id	reg_date_mentor	region_id_mentor	mentee_id	reg_date_mentee	region_id_mentee
5528	9165	2022-09-03 00:00:00	finished	5	20410	2022-04-28 00:00:00	6	11557	2021-05-15 00:00:00	3
2370	3891	2022-05-30 00:00:00	canceled	1	20879	2021-10-07 00:00:00	1	10154	2022-05-22 00:00:00	1
6473	10683	2022-09-15 00:00:00	finished	2	21457	2022-01-13 00:00:00	1	14505	2022-09-11 00:00:00	1
1671	2754	2022-04-22 00:00:00	canceled	6	21851	2021-08-24 00:00:00	1	13579	2021-09-12 00:00:00	2
324	527	2021-10-30 00:00:00	finished	1	22243	2021-07-04 00:00:00	1	14096	2021-10-10 00:00:00	10
4500	7453	2022-08-13 00:00:00	finished	4	22199	2021-12-02 00:00:00	5	11743	2021-11-01 00:00:00	8
2356	3875	2022-05-29 00:00:00	finished	2	21434	2022-04-29 00:00:00	4	14960	2021-12-12 00:00:00	0
2722	4491	2022-06-16 00:00:00	finished	2	21462	2022-06-05 00:00:00	7	12627	2021-02-23 00:00:00	2
6016	9929	2022-09-10 00:00:00	finished	1	20802	2021-08-07 00:00:00	1	10121	2022-07-30 00:00:00	1
4899	8121	2022-08-22 00:00:00	finished	1	24920	2021-10-19 00:00:00	5	12223	2022-07-04 00:00:00	4

This data is merged tables from database. I used it for my project.

I got many many SQL queries, few correlation matrix for this data and one non linear regression model.

First of all I need to make new data with similar properties (I can't use original data for my portfolio case). And it will be great if there's the way to generate data for longer time period.

Where should I start? Can I solve this problem with sklearn.datasets?

PS I already tryed Synthetic Data Vault and have failed. I can't use Faker, because I need to keep data structure.

Define "synthetic data". What is it doing to be used for? What properties it must have? — Silver Light, Jun 26 '23 at 10:11
I need copy of database, that I used for my project. I got many many SQL queries, few correlation matrix for this data and one non linear regression model. First of all i need to make new data with similar properties. And it will be great if there's the way to generate data for longer time period. — John Doe, Jun 26 '23 at 13:43
Look at using Faker to create data. This will require you to specify the properties of each data type, but easily accomplished with Faker. — itprorh66, Jun 26 '23 at 14:36

score 1 · Answer 1 · answered Jun 29 '23 at 14:29

1

This is the best SDG project out there and has a GUI: https://github.com/ydataai/ydata-synthetic/

answered Jun 29 '23 at 14:29

Gonçalo Martins Ribeiro

11
2

I saw this project, I try it later. =) – John Doe Jun 29 '23 at 20:02

score 0 · Answer 2 · answered Jun 26 '23 at 23:44

I am not positive this is what you are looking for, but here is a way to use Faker to create sample data that conforms to specific criterion.

from faker import Faker
import pandas as pd

dflen = 10
df1 = pd.DataFrame()
df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
                session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime('2022-01-01'),pd.to_datetime('2022-12-31')) for i in range(dflen)),
                 session_status = pd.Series(rnd.choice(['Finished', 'Canceled']) for i in range(dflen)),
                 mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
                 mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
                 Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime('2001-01-01'),pd.to_datetime('2013-12-31')) for i in range(dflen)),
                 mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))
    
df1

This will create a df of the form:

    session_id  session_date_time   session_status  mentor_domain_id    mentor_id   Reg_date_mentor mentor_mentee_id
0   2030    2022-04-27  Canceled    24  2546    2003-08-21  77
1   4721    2022-01-29  Canceled    26  1205    2003-09-11  60
2   4208    2022-11-15  Canceled    5   1718    2010-08-10  38
3   1220    2022-02-11  Canceled    16  2864    2008-07-30  41
4   4268    2022-05-12  Canceled    30  2160    2009-08-20  67
5   3942    2022-06-02  Canceled    12  1776    2003-11-18  73
6   2229    2022-03-13  Canceled    20  2250    2003-12-28  37
7   1696    2022-06-07  Finished    31  2268    2010-06-04  44
8   3898    2022-11-03  Finished    9   1331    2012-01-08  23
9   3761    2022-11-14  Canceled    29  1682    2008-09-09  47

You can further customize data and create reliance between data in one column with another, depending on you specific needs.

Thanks for answer, but I finaly make new data by GaussianCopulaSynthesizer from Synthetic Data Vault. — John Doe, Jun 29 '23 at 09:41
If you have answered your own question, it would be nice if you could share your solution by providing an Answer to your own question and marking it as accepted. In this way, we all learn. — itprorh66, Jun 29 '23 at 13:44

score 0 · Answer 3 · answered Jun 29 '23 at 14:18

I make new data by GaussianCopulaSynthesizer from Synthetic Data Vault. I add some Predefined Constraint Classes for some columns and run conditional sampling to keep properties of original dataset.

# create metadata for dataset (it's not required step, cause metadata detects automatically).
# I had updated metadata for every column

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

metadata.update_column(
    column_name='session_id',
    sdtype='id',
    regex_format='[0-9]{6}')
metadata.validate()

# create synthesizer (this synthesizer works better for my data):


distributions = {
    'reg_date_mentee': 'uniform',
    'mentee_id': 'uniform'
}

synthesizer = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distributions)

# add constraints to synthesizer (adding rules that every row in the data must follow).
# I add constraints for most columns.

my_constraint_mentee_id = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'mentee_id',
        'low_value': 20001,
        'high_value': 21847,
        'strict_boundaries': False
    }
}

synthesizer.add_constraints(constraints=[
    my_constraint_mentee_id
])

# fit synthesizer;

synthesizer.fit(sessions_and_users1)

# make list of conditions;

# Make conditions you need by Condition from sdv.sampling.
# All conditions keeping in list.


# make data sample with conditions.

synthetic_data_with_conditions = synthesizer.sample_from_conditions(
    conditions=conditions)

I won't add full code as it will take up too much space.

How to create synthetic data based on real data?

3 Answers3