I need to generate a random n-dimensional dataset having m tuples. The first four dimensions are expected to be correlated with the ground truth vector y
and the remaining ones are to be arbitrarily generated. I will use the dataset for my regression task using Scikit-learn. How can I generate this data?
for example: A dataset where tuple size(m)=10000 and dimension size(n)=100
After that, I need to split the dataset such that randomly selected 70% tuples are used for training while 30% tuples are used for testing.
PS: I have found this code in sci-kit learn but I am not sure if I can use it. How can I translate this into my problem?
x, y, coef = datasets.make_regression(n_samples=100,#number of samples
n_features=1,#number of features
n_informative=1,#number of useful features
noise=10,#bias and standard deviation of the guassian noise
coef=True,#true coefficient used to generated the data
random_state=0) #set for same data points for each run