Select columns which contains a string in pyspark

Question

I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:

df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']

I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:

['hello_world','hello_country','hello_everyone','index']

I want something like df.select('hello*','index')

Thanks in advance :)

EDIT:

I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it

score 16 · Accepted Answer · answered Nov 21 '18 at 09:49

16

I've found a quick and elegant way:

selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)

With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.

answered Nov 21 '18 at 09:49

Manrique

Great solution. and do not need `*` before `selected`? – Ali AzG Nov 21 '18 at 09:52
Thanks ! I don't :) – Manrique Nov 21 '18 at 09:58
You can take it one step further You can keep it all in the one line, like this: `selected = df.select([s for s in df.columns if 'hello' in s]+['index'])`. – chrimaho Feb 13 '22 at 23:59

score 5 · Answer 2 · edited Jun 04 '19 at 15:04

5

You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.

edited Jun 04 '19 at 15:04

cs95

answered Nov 21 '18 at 13:59

Neeraj Bhadani

the link for colRegex is https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.colRegex.html – Mr.J Apr 29 '22 at 18:02

score 1 · Answer 3 · edited Nov 21 '18 at 09:44

1

This sample code does what you want:

hello_cols = []

for col in df.columns:
  if(('index' in col) or ('hello' in col)):
    hello_cols.append(col)

df.select(*hello_cols)

edited Nov 21 '18 at 09:44

Manrique

answered Nov 21 '18 at 09:39

Ali AzG

Thanks, i fixed an error in your code and it worked. – Manrique Nov 21 '18 at 09:45
@Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful. – Ali AzG Nov 21 '18 at 09:46
I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much ! – Manrique Nov 21 '18 at 09:47

score 0 · Answer 4 · answered Mar 20 '23 at 10:10

0

i used Manrique answer and improvised.

sel_cols = [i for i in df.columns if i.startswith("colName")]

df = df.select('*', *(F.col(x).alias('rename_text' + x) for x in sel_cols))

answered Mar 20 '23 at 10:10

Arun Mohan

4 Answers4