0

When I look the pandas column type, when I placed strings inside, the dtype returned was object. My dataframe will be read-only, it means that I don't mind if the type is a s(size). And I see on this question that when the type of numpy array is object it loses speed...

And it made me come to my doubt:

  • When the dtype of a pandas Series is object, do I lose speed?
  • And if I lose, how can I avoid this?
  • Is there any way to make a series have a predetermined size, like s256?
wjandrea
  • 28,235
  • 9
  • 60
  • 81
DazzRick
  • 85
  • 7
  • 2
    Have you read this part of the user guide? [Text data types](https://pandas.pydata.org/docs/user_guide/text.html#text-types) – wjandrea Aug 09 '23 at 17:40
  • Yes, I think that I will use `StringDtype`, but in the guide say: "_Currently, the performance of object dtype arrays of strings and arrays.StringArray are about the same._" – DazzRick Aug 09 '23 at 20:41
  • If you can avoid strings like the plague. They are inherently slow (and optimising string operation is a pain, especially unicode ones). If the number of unique string is small, please consider using the `category` datatype (which uses integers internally). If you know your strings are small, then using Numpy cheat might help. For large variable-sized strings, there is basically not much to do. A `s256` Numpy type will have a huge overhead (each string will take 256 bytes in memory even if they are actually smaller). – Jérôme Richard Aug 09 '23 at 21:04
  • 1
    Based on the Pandas code, it looks like `StringDtype` is a datatype supporting both `StringArray` and `ArrowStringArray`. The former is currently as efficient as a `object`-dtyped arrays, but it will be optimized in the future. The later is certainly more efficient (especially for short strings I guess). – Jérôme Richard Aug 09 '23 at 21:11
  • @JérômeRichard If I have not way unless use strings, the `category` is best? Or is the `ArrowStringArray`? – DazzRick Aug 10 '23 at 14:31
  • @DazzRick I did not understand your sentence. I do not think you can specify `ArrowStringArray` as a dtype. I think this internal type is only used for arrow files (and I guess you do not use them). `category` is only useful when the number of unique string is small compared to the number of row. It can be much slower when this is not the case (and it would not make sense anyway). – Jérôme Richard Aug 10 '23 at 21:57
  • I undestand. How many units are the small, that you said for `category`? – DazzRick Aug 11 '23 at 14:09
  • This is hard to said since it is dependent of your machine but let say the number of unique value should be at least less than 1/4 of the number of row. Otherwise, categorial will certainly not worth it. Note that it also depend of what you do with the columns. Regarding the operations, catégorial can be useless or significantly faster, though the later is more frequent. – Jérôme Richard Aug 11 '23 at 17:04
  • If I have a column with only the values `['hello', 'my', 'world']` or `['a', 'b', 'c', 'd', 'e']`, `category` is best? – DazzRick Aug 11 '23 at 18:05

0 Answers0