The dataframe df_problematic
in PySpark has the following columns:
+------------+-----------+------------+
|sepal@length|sepal.width|petal_length|
+------------+-----------+------------+
| 5.1| 3.5| 1.4|
| 4.9| 3| 1.4|
I'd expect the dataframe would not load or throw some error since the columns have @
and .
.
But it looks like it loads just fine.
How can it be loaded?
Operations on the columns with special characters (unless I surround the column with `) throw an error. However, operations on the columns with normal names work just fine, e.g. sampling:
df_problematic_sampled = df_problematic.sample(fraction=0.8)
df_problematic_sampled.head(3)
Output:
[Row(sepal@length='4.7', sepal.width='3.2', petal_length='1.3', petal.width='.2', variety='Setosa'),
Row(sepal@length='4.6', sepal.width='3.4', petal_length='1.4', petal.width='.3', variety='Setosa'),
Row(sepal@length='4.4', sepal.width='2.9', petal_length='1.4', petal.width='.2', variety='Setosa')]
Does it mean that as long as I do not use the columns with special characters, and perform operations only on the columns with normal names, the dataframe df_problematic
can be e.g. sampled/grouped/saved just fine?