While splitting datasets why do people follow naming convention?

Question

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=1)

In this example (X_train, X_test) X mentioned as uppercase where as (y_train, y_test) here y mentioned in lowercase.

Is there any compelling reason to follow that naming convention?

I don't see **any** logical reason to follow this convention of uppercase X and lowercase y. Most of the new learners when introduced to data science starts by following some existing examples which traces back to the official sklearn docs and hence, I think, the legacy/convention continues. — Sheldore, Sep 08 '18 at 16:59
Possible duplicate of [Is there a standard file naming convention for key-value pairs in filename?](https://stackoverflow.com/questions/10087079/is-there-a-standard-file-naming-convention-for-key-value-pairs-in-filename) — Raman Mishra, Sep 08 '18 at 17:06
Welcome to Stack Overflow! I indented your code sample by 4 spaces so that it renders properly, and used "backticks" around the variable references to make them distinct from natural language - please see the [editing help](https://stackoverflow.com/editing-help) for more information on formatting. Good luck! — mdaniel, Sep 08 '18 at 21:16
@Raman My question is not related to filename conventions, hence it is not duplicate. — satyavvd, Sep 11 '18 at 11:35
See also https://stats.stackexchange.com/q/389395/232706, https://datascience.stackexchange.com/q/17598/55122. — Ben Reiniger, Jun 18 '21 at 18:13

score 6 · Answer 1 · answered Sep 08 '18 at 21:35

This comes from the case where you have multiple features (inputs) and one response variable (output). Then, the input X is a matrix with number_of_features columns and number_of_samples rows, and the output y is a column vector with number_of_samples elements. Following the convention of naming matrices with uppercase letters and vectors with lowercase letters, which is widely used in mathematics and/or related fields, it makes sense that X has to be uppercase and y has to be lowercase.

If you only have one feature, so the input is a column vector and not a matrix, then x should be lowercase. And if you have multiple response variables and the output is a matrix, then Y should be uppercase.

Finally, using more descriptive names than X and y is always a good idea. Then, following the PEP 8 convention uf using snake_case for variable names - or whatever the style guide you are following recommends - is the way to go.

While splitting datasets why do people follow naming convention?

1 Answers1

Linked