0

I am new to Data Science and learning to impute and about model training. Below are my few queries that I came across when training the datasets. Please provide answers to these.

  1. Suppose I have a dataset with 1000 observations. Now I train the model on the complete dataset in one go. Another way I did it, I divided my dataset in 80% and 20% and trained my model first at 80% and then on 20% data. Is it same or different? Basically, if I train my already trained model on new data, what does it mean?

Imputing Related

  1. Another question is related to imputing. Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?

  2. When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?

  3. How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.

Aman
  • 979
  • 3
  • 10
  • 23
  • 1
    You would not impute cabin number, since someone either has a cabin number or not, if a person does not have a cabin (number) then you should label this with a category, for ex. every person without a cabin gets a cabin number of -1. As for your first question, I don't understand, what exactly are you doing there and why? – user2974951 Oct 12 '18 at 08:58

1 Answers1

0

Suppose I have a dataset with 1000 observations. Now I train the model on the complete dataset in one go. Another way I did it, I divided my dataset in 80% and 20% and trained my model first at 80% and then on 20% data. Is it same or different?

It's hard to say: is it good or not. Generally, if your data (splits) are taken from the same distribution - you can perform additional training. However, not all model types are good for it. I advice you to run some kind of cross-validation with 80/20 splitting and error measurement checking before additional training and after.

Basically, if I train my already trained model on new data, what does it mean?

If you take the datasets from the same distribution: you perform additional learning what theoretically should have positive influence on your model.

Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?

You need clearly understand what do you want to do by imputation. If only first-class has values, how you can perform imputation for the second- or third-class? What do you need to find? Deck? Cabin number? Do you want to find new values or impute by already existing values?

When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?

Very generally, you run imputation algorithm on the whole data you have (without target column).

How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.

If you have the finite number of cases, you just need to impute values as strings. If not, perform feature engineering: try to predict letter, number, first digit of the number, len(number) and so on.

avchauzov
  • 1,007
  • 1
  • 8
  • 13