0

Here is my problem: i have to generate some synthetic data (like 7/8 columns), correlated each other (using pearson coefficient). I can do this easily, but next i have to insert a percentage of duplicates in each column (yes, pearson coefficent will be lower), different for each column. The problem is that i don't want to insert personally that duplicates, cause in my case it would be like to cheat.

Someone knows how to generate correlated data already duplicates ? I've searched but usually questions are about drop or avoid duplicates..

Language: python3 To generate correlated data i'm using this simple code: Generatin correlated data

davide
  • 91
  • 7

2 Answers2

0

Try something like this :

indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))

for index in indices:
  array.append(array[index])

Here I make the assumption that your data is stored in array which is an ndarray, where each row contains your 7/8 columns of data. The above code should create an array of random indices, whose entries (rows) you select and append again to the array.

Thomas Lang
  • 770
  • 1
  • 7
  • 17
  • This is what I would like to avoid doing. I would like to have a generator of numbers which includes duplicates. Other problem of this answer: not append, by still have same dimension. However, thanks for the answer. – davide Dec 08 '18 at 16:31
  • So, you don't want all new entries to be duplicates but some new data and some duplicates? – Thomas Lang Dec 08 '18 at 16:34
0

i find out the solution. I post the code, it might be helpful for someone.

#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1

#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))

from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)

from scipy.linalg import cholesky

upper_chol = cholesky(a)

# Finally, compute the inner product of upper_chol and rnd
ans = rnd @ upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)

#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4', 
                            'att5', 'att6', 'att7', 'att8'])

#last step is to truncate float values of ans in a variable way, so i got 
#duplicates in varying percentage
a = df.values
for i in range(8):
     trunc = np.random.randint(5,12)
     print(trunc)
     a.T[i] = a.T[i].round(decimals=trunc)


#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value

Finally, those are my duplicates percentages for each column:

duplicate rate attribute: att1 = 5.159390000000002

duplicate rate attribute: att2 = 11.852260000000001

duplicate rate attribute: att3 = 12.036079999999998

duplicate rate attribute: att4 = 35.10611

duplicate rate attribute: att5 = 4.6471599999999995

duplicate rate attribute: att6 = 35.46553

duplicate rate attribute: att7 = 0.49115000000000464

duplicate rate attribute: att8 = 37.33252
davide
  • 91
  • 7