0

I have this below error when trying to Apply this code below : I am doing a tutorial based on this page : https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184


 File "reviewsML.py", line 58, in <module>
    X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
….
ValueError: Found input variables with inconsistent numbers of samples: [25707, 25000]

Here the part of code

reviews_train = []
for line in codecs.open('movie_data/full_train.txt', 'r', 'utf-8'):
    reviews_train.append(line.strip())

reviews_test = []
for line in codecs.open('movie_data/full_test.txt', 'r', 'utf-8'):
    reviews_test.append(line.strip())

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")

REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")



def preprocess_reviews(reviews):

    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]

    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]

    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)
print(len(reviews_train_clean))

from sklearn.feature_extraction.text import CountVectorizer
#construction of the classfier :  hyperparameter c => adjusts the regularization
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean) #dimensionality reduction, return transformed data
X_test = cv.transform(reviews_test_clean)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i < 12500 else 0 for i in range(25000)]
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.75)
for c in [0.01, 0.05, 0.25, 0.5, 1]:

    lr = LogisticRegression(C=c)

    lr.fit(X_train, y_train)

    print ("Accuracy for C=%s: %s"


            % (c, accuracy_score(y_val, lr.predict(X_val))))


Do you know what I am doing wrong ?

I tried to print (X.shape[0]) it gives me 25707

But I do not know why beacuse the original file contains 25 000 for the train and the test

kely789456123
  • 605
  • 1
  • 6
  • 21
  • 1
    Your `X` and `target` are obviously of different length – desertnaut Jun 03 '19 at 14:48
  • @desertnaut I checked the initial file and the train and test file have both 25000 lines. Do you know a way to verify if X or target have 25000 samples ? – kely789456123 Jun 03 '19 at 15:04
  • 1
    Just print(f'{len(X)}') prior to the train_test_split – MichaelD Jun 03 '19 at 15:08
  • have this error : raise TypeError("sparse matrix length is ambiguous; use getnnz()" TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] – kely789456123 Jun 03 '19 at 15:23
  • 1
    What is `len(reviews_train_clean)`? If you are creating your dataframe from a text file, it is possible that your dataframe has more rows than your file because of separators (such as a newline character) in the middle of the line. – panktijk Jun 03 '19 at 23:15
  • @panktijk 25707 – kely789456123 Jun 04 '19 at 06:21
  • 1
    That means your input data itself has 25707 rows. There might be something in the way you're creating the dataframe. Can't tell without looking at your file and code. – panktijk Jun 05 '19 at 00:53
  • @panktijk I decided to use directly the code on my own data instead of the data of the tutorial and It worked without errors. I will continue the tutorial with my own data. Thank you for answer. – kely789456123 Jun 05 '19 at 10:30

0 Answers0