17

i have multiple dataframe columns which look like this:

                         Day1
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

What i want is that every character is seperated in a own column:

     012345678910111213....
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

So that "Day 1-Column" is splitted in 48 Columns and every Column has one of the Value A/B/C/D

i tried with split, but that didnt work.

Warry S.
  • 1,203
  • 2
  • 9
  • 9

5 Answers5

30

You can call apply and for each row call pd.Series on the the list of the values:

In [16]:

df['Day1'].apply(lambda x: pd.Series(list(x)))
Out[16]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]

It looks like you have trailing spaces, remove these using str.rstrip:

df['Day1'] = df['Day1'].str.rstrip()

then do the above.

Ric S
  • 9,073
  • 3
  • 25
  • 51
EdChum
  • 376,765
  • 198
  • 813
  • 562
6

use Series.str.extractall() method:

In [19]: df.Day1.str.extractall('(.)', flags=re.U)[0].unstack().rename_axis(None, 1)
Out[19]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
4

Try this:

df['Day1'].str.split(pat ="\s*", expand = True)

It will have empty 1st and last columns so you have to trim the dataframe using df['Day1'].iloc[:,1:-1]

arjepak
  • 153
  • 1
  • 6
1

The solution provided by @EdChum is effective for splitting strings, but it can be computationally expensive when dealing with large DataFrames. An alternative approach that offers improved performance is as follows:

df.Day1.str.split('', expand=True).iloc[:, 1:-1]

The use of .iloc[:, 1:-1] in this code is essential to remove the automatically added first and last columns that result from splitting a string with an empty delimiter ('').


To demonstrate the performance difference, consider the following benchmarking results:

python
import pandas as pd
df = pd.DataFrame(['asdf' + str(x) for x in range(1000)], columns=['Day1'])

%%timeit
df.Day1.apply(lambda x: pd.Series(list(x)))
# Result: 401 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.Day1.str.split(pat="\s*", expand=True).iloc[:, 1:-1]
# Result: 9.1 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.Day1.str.split('', expand=True).iloc[:, 1:-1]
# Result: 8.59 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can observe, the .apply(lambda x: pd.Series(list(x)) method is nearly 50 times slower (401 ms vs. 8.59 ms) compared to the .split() method. The inefficiency arises from the conversion of each value into a Pandas Series, which can be particularly taxing on large datasets.

By utilizing the .str.split() approach, you can significantly enhance the performance of string splitting operations in your DataFrame.

Hagai G.
  • 11
  • 1
0

Following on from the answer from @ric-s, using list to separate the string is slightly faster when applying it outside of pandas:

In [1]: %timeit df['Day1'].apply(lambda x: pd.Series(list(x)))
1.08 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit pd.DataFrame([list(x) for x in df['Day1']])
718 µs ± 2.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also, the following construction will create meaningful column names for the extracted features:

df[[f'Day1_{i}' for i in range(len(df['Day1'][0]))]] = pd.DataFrame([list(x) for x in df['Day1']])
njp
  • 620
  • 1
  • 3
  • 16