Pandas split dataframe column for every character

Question

i have multiple dataframe columns which look like this:

                         Day1
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

What i want is that every character is seperated in a own column:

     012345678910111213....
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

So that "Day 1-Column" is splitted in 48 Columns and every Column has one of the Value A/B/C/D

i tried with split, but that didnt work.

Post raw data, code to load your data into a df, in order for us to try to replicate your issue if our answers didn't work — EdChum, May 08 '17 at 13:32
It looks like you have trailing spaces, try `dataframe['Mo'] dataframe['Mo'].str.rstrip()` to remove any trailing spaces — EdChum, May 08 '17 at 13:34
See my first comment, without data to reproduce this, this becomes a fishing expedition — EdChum, May 08 '17 at 13:40
ok i found the problem, i had trailing spaces. Thanks!!! @EdChum — Warry S., May 08 '17 at 14:15

score 30 · Accepted Answer · edited Feb 03 '22 at 08:18

You can call apply and for each row call pd.Series on the the list of the values:

In [16]:

df['Day1'].apply(lambda x: pd.Series(list(x)))
Out[16]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]

It looks like you have trailing spaces, remove these using str.rstrip:

df['Day1'] = df['Day1'].str.rstrip()

then do the above.

MaxU - stand with Ukraine · Answer 2 · 2017-05-08T13:39:51.797

6

use Series.str.extractall() method:

In [19]: df.Day1.str.extractall('(.)', flags=re.U)[0].unstack().rename_axis(None, 1)
Out[19]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]

edited May 08 '17 at 13:39

answered May 08 '17 at 13:16

MaxU - stand with Ukraine

205,989
36
386
419

Hey, i tried your way...i edited my question, there is anything wrong – Warry S. May 08 '17 at 13:28
1

@WarryS., do you have leading or trailing spaces? what is the output of `df.Day1.str.len()`? – MaxU - stand with Ukraine May 08 '17 at 13:30
its 48 vor every entry @MaxU, no between the values there are no spaces – Warry S. May 08 '17 at 13:32
1

@WarryS., i can't reproduce this behavior using provided sample data set :( – MaxU - stand with Ukraine May 08 '17 at 13:34
Thanks for your help @MaxU – Warry S. May 08 '17 at 14:21

score 4 · Answer 3 · answered May 27 '20 at 16:15

4

Try this:

df['Day1'].str.split(pat ="\s*", expand = True)

It will have empty 1st and last columns so you have to trim the dataframe using df['Day1'].iloc[:,1:-1]

answered May 27 '20 at 16:15

arjepak

153
1
6

score 1 · Answer 4 · answered Jul 04 '23 at 18:08

The solution provided by @EdChum is effective for splitting strings, but it can be computationally expensive when dealing with large DataFrames. An alternative approach that offers improved performance is as follows:

df.Day1.str.split('', expand=True).iloc[:, 1:-1]

The use of .iloc[:, 1:-1] in this code is essential to remove the automatically added first and last columns that result from splitting a string with an empty delimiter ('').

To demonstrate the performance difference, consider the following benchmarking results:

python
import pandas as pd
df = pd.DataFrame(['asdf' + str(x) for x in range(1000)], columns=['Day1'])

%%timeit
df.Day1.apply(lambda x: pd.Series(list(x)))
# Result: 401 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.Day1.str.split(pat="\s*", expand=True).iloc[:, 1:-1]
# Result: 9.1 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.Day1.str.split('', expand=True).iloc[:, 1:-1]
# Result: 8.59 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can observe, the .apply(lambda x: pd.Series(list(x)) method is nearly 50 times slower (401 ms vs. 8.59 ms) compared to the .split() method. The inefficiency arises from the conversion of each value into a Pandas Series, which can be particularly taxing on large datasets.

By utilizing the .str.split() approach, you can significantly enhance the performance of string splitting operations in your DataFrame.

score 0 · Answer 5 · answered May 04 '22 at 22:55

Following on from the answer from @ric-s, using list to separate the string is slightly faster when applying it outside of pandas:

In [1]: %timeit df['Day1'].apply(lambda x: pd.Series(list(x)))
1.08 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit pd.DataFrame([list(x) for x in df['Day1']])
718 µs ± 2.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also, the following construction will create meaningful column names for the extracted features:

df[[f'Day1_{i}' for i in range(len(df['Day1'][0]))]] = pd.DataFrame([list(x) for x in df['Day1']])

Pandas split dataframe column for every character

5 Answers5

Linked

Related