0

I want to make synthetic data based on real data.

Data sample:

session_id session_date_time session_status mentor_domain_id mentor_id reg_date_mentor region_id_mentor mentee_id reg_date_mentee region_id_mentee
5528 9165 2022-09-03 00:00:00 finished 5 20410 2022-04-28 00:00:00 6 11557 2021-05-15 00:00:00 3
2370 3891 2022-05-30 00:00:00 canceled 1 20879 2021-10-07 00:00:00 1 10154 2022-05-22 00:00:00 1
6473 10683 2022-09-15 00:00:00 finished 2 21457 2022-01-13 00:00:00 1 14505 2022-09-11 00:00:00 1
1671 2754 2022-04-22 00:00:00 canceled 6 21851 2021-08-24 00:00:00 1 13579 2021-09-12 00:00:00 2
324 527 2021-10-30 00:00:00 finished 1 22243 2021-07-04 00:00:00 1 14096 2021-10-10 00:00:00 10
4500 7453 2022-08-13 00:00:00 finished 4 22199 2021-12-02 00:00:00 5 11743 2021-11-01 00:00:00 8
2356 3875 2022-05-29 00:00:00 finished 2 21434 2022-04-29 00:00:00 4 14960 2021-12-12 00:00:00 0
2722 4491 2022-06-16 00:00:00 finished 2 21462 2022-06-05 00:00:00 7 12627 2021-02-23 00:00:00 2
6016 9929 2022-09-10 00:00:00 finished 1 20802 2021-08-07 00:00:00 1 10121 2022-07-30 00:00:00 1
4899 8121 2022-08-22 00:00:00 finished 1 24920 2021-10-19 00:00:00 5 12223 2022-07-04 00:00:00 4

This data is merged tables from database. I used it for my project.

I got many many SQL queries, few correlation matrix for this data and one non linear regression model.

First of all I need to make new data with similar properties (I can't use original data for my portfolio case). And it will be great if there's the way to generate data for longer time period.

Where should I start? Can I solve this problem with sklearn.datasets?

PS I already tryed Synthetic Data Vault and have failed. I can't use Faker, because I need to keep data structure.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
John Doe
  • 95
  • 6
  • 1
    Define "synthetic data". What is it doing to be used for? What properties it must have? – Silver Light Jun 26 '23 at 10:11
  • I need copy of database, that I used for my project. I got many many SQL queries, few correlation matrix for this data and one non linear regression model. First of all i need to make new data with similar properties. And it will be great if there's the way to generate data for longer time period. – John Doe Jun 26 '23 at 13:43
  • 1
    Look at using Faker to create data. This will require you to specify the properties of each data type, but easily accomplished with Faker. – itprorh66 Jun 26 '23 at 14:36

3 Answers3

1

This is the best SDG project out there and has a GUI: https://github.com/ydataai/ydata-synthetic/

0

I am not positive this is what you are looking for, but here is a way to use Faker to create sample data that conforms to specific criterion.

from faker import Faker
import pandas as pd

dflen = 10
df1 = pd.DataFrame()
df1 = df1.assign(session_id = pd.Series(fake.unique.random_int(min=800, max=5000) for i in range(dflen)),
                session_date_time = pd.Series(fake.date_between_dates(pd.to_datetime('2022-01-01'),pd.to_datetime('2022-12-31')) for i in range(dflen)),
                 session_status = pd.Series(rnd.choice(['Finished', 'Canceled']) for i in range(dflen)),
                 mentor_domain_id = pd.Series(fake.unique.random_int(min=1, max=35) for i in range(dflen)),
                 mentor_id = pd.Series(fake.unique.random_int(min=1000, max=3000) for i in range(dflen)),
                 Reg_date_mentor =pd.Series(fake.date_between_dates(pd.to_datetime('2001-01-01'),pd.to_datetime('2013-12-31')) for i in range(dflen)),
                 mentor_mentee_id = pd.Series(fake.unique.random_int(min=15, max=90) for i in range(dflen)))
    
df1  

This will create a df of the form:

    session_id  session_date_time   session_status  mentor_domain_id    mentor_id   Reg_date_mentor mentor_mentee_id
0   2030    2022-04-27  Canceled    24  2546    2003-08-21  77
1   4721    2022-01-29  Canceled    26  1205    2003-09-11  60
2   4208    2022-11-15  Canceled    5   1718    2010-08-10  38
3   1220    2022-02-11  Canceled    16  2864    2008-07-30  41
4   4268    2022-05-12  Canceled    30  2160    2009-08-20  67
5   3942    2022-06-02  Canceled    12  1776    2003-11-18  73
6   2229    2022-03-13  Canceled    20  2250    2003-12-28  37
7   1696    2022-06-07  Finished    31  2268    2010-06-04  44
8   3898    2022-11-03  Finished    9   1331    2012-01-08  23
9   3761    2022-11-14  Canceled    29  1682    2008-09-09  47 

You can further customize data and create reliance between data in one column with another, depending on you specific needs.

itprorh66
  • 3,110
  • 4
  • 9
  • 21
  • Thanks for answer, but I finaly make new data by GaussianCopulaSynthesizer from Synthetic Data Vault. – John Doe Jun 29 '23 at 09:41
  • Not a library I am familiar with but will check it out. – itprorh66 Jun 29 '23 at 13:43
  • If you have answered your own question, it would be nice if you could share your solution by providing an Answer to your own question and marking it as accepted. In this way, we all learn. – itprorh66 Jun 29 '23 at 13:44
0

I make new data by GaussianCopulaSynthesizer from Synthetic Data Vault. I add some Predefined Constraint Classes for some columns and run conditional sampling to keep properties of original dataset.

# create metadata for dataset (it's not required step, cause metadata detects automatically).
# I had updated metadata for every column

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

metadata.update_column(
    column_name='session_id',
    sdtype='id',
    regex_format='[0-9]{6}')
metadata.validate()

# create synthesizer (this synthesizer works better for my data):


distributions = {
    'reg_date_mentee': 'uniform',
    'mentee_id': 'uniform'
}

synthesizer = GaussianCopulaSynthesizer(
    metadata,
    numerical_distributions=distributions)

# add constraints to synthesizer (adding rules that every row in the data must follow).
# I add constraints for most columns.

my_constraint_mentee_id = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'mentee_id',
        'low_value': 20001,
        'high_value': 21847,
        'strict_boundaries': False
    }
}

synthesizer.add_constraints(constraints=[
    my_constraint_mentee_id
])

# fit synthesizer;

synthesizer.fit(sessions_and_users1)

# make list of conditions;

# Make conditions you need by Condition from sdv.sampling.
# All conditions keeping in list.


# make data sample with conditions.

synthetic_data_with_conditions = synthesizer.sample_from_conditions(
    conditions=conditions)

I won't add full code as it will take up too much space.

John Doe
  • 95
  • 6