2

I have 250k text documents (tweets and newspaper articles) represented as vectors obtained with a doc2vec model. Now, I want to use a regressor (multiple linear regression) to predict continuous value outputs - in my case the UK Consumer Confidence Index. My code runs, since forever. What am I doing wrong?

I imported my data from Excel and splitted it into x_train and x_dev. The data are composed of preprocessed text and CCI continuous values.

# Import doc2vec model
dbow = Doc2Vec.load('dbow_extended.d2v')
dmm = Doc2Vec.load('dmm_extended.d2v')
concat = ConcatenatedDoc2Vec([dbow, dmm]) # model uses vector_size 400

def get_vectors(model, input_docs):
    vectors = [model.infer_vector(doc.words) for doc in input_docs]
    return vectors

# Prepare X_train and y_train
train_text = x_train["preprocessed_text"].tolist()
train_tagged = [TaggedDocument(words=str(_d).split(), tags=[str(i)]) for i, _d in list(enumerate(train_text))]
X_train = get_vectors(concat, train_tagged)
y_train=x_train['CCI_UK']

# Fit regressor 
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

# Predict and evaluate
prediction=reg.predict(X_dev)
print(classification_report(y_true=y_dev,y_pred=prediction),'\n')

Since the fitting never completed, I wonder whether I am using a wrong input. However, no error message is shown and the code simply runs forever. What am I doing wrong?

Thank you so much for your help!!

1 Answers1

1

The variable X_train is a list or a list of lists (since the function get_vectors() return a list) whereas the input to sklearn's Linear Regression should be a 2-D array.

Try converting X_train to an array using this :

X_train = np.array(X_train)

This should help !

Amit
  • 101
  • 1
  • 5