2
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=1)

In this example (X_train, X_test) X mentioned as uppercase where as (y_train, y_test) here y mentioned in lowercase.

Is there any compelling reason to follow that naming convention?

mdaniel
  • 31,240
  • 5
  • 55
  • 58
satyavvd
  • 39
  • 1
  • I don't see **any** logical reason to follow this convention of uppercase X and lowercase y. Most of the new learners when introduced to data science starts by following some existing examples which traces back to the official sklearn docs and hence, I think, the legacy/convention continues. – Sheldore Sep 08 '18 at 16:59
  • Possible duplicate of [Is there a standard file naming convention for key-value pairs in filename?](https://stackoverflow.com/questions/10087079/is-there-a-standard-file-naming-convention-for-key-value-pairs-in-filename) – Raman Mishra Sep 08 '18 at 17:06
  • Welcome to Stack Overflow! I indented your code sample by 4 spaces so that it renders properly, and used "backticks" around the variable references to make them distinct from natural language - please see the [editing help](https://stackoverflow.com/editing-help) for more information on formatting. Good luck! – mdaniel Sep 08 '18 at 21:16
  • @Raman My question is not related to filename conventions, hence it is not duplicate. – satyavvd Sep 11 '18 at 11:35
  • See also https://stats.stackexchange.com/q/389395/232706, https://datascience.stackexchange.com/q/17598/55122. – Ben Reiniger Jun 18 '21 at 18:13

1 Answers1

6

This comes from the case where you have multiple features (inputs) and one response variable (output). Then, the input X is a matrix with number_of_features columns and number_of_samples rows, and the output y is a column vector with number_of_samples elements. Following the convention of naming matrices with uppercase letters and vectors with lowercase letters, which is widely used in mathematics and/or related fields, it makes sense that X has to be uppercase and y has to be lowercase.

If you only have one feature, so the input is a column vector and not a matrix, then x should be lowercase. And if you have multiple response variables and the output is a matrix, then Y should be uppercase.

Finally, using more descriptive names than X and y is always a good idea. Then, following the PEP 8 convention uf using snake_case for variable names - or whatever the style guide you are following recommends - is the way to go.

hbaderts
  • 14,136
  • 4
  • 41
  • 48