7

I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:

df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']

I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:

['hello_world','hello_country','hello_everyone','index']

I want something like df.select('hello*','index')

Thanks in advance :)

EDIT:

I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it

Manrique
  • 2,083
  • 3
  • 15
  • 38

4 Answers4

16

I've found a quick and elegant way:

selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)

With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.

Manrique
  • 2,083
  • 3
  • 15
  • 38
5

You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.

cs95
  • 379,657
  • 97
  • 704
  • 746
Neeraj Bhadani
  • 2,930
  • 16
  • 26
  • the link for colRegex is https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.colRegex.html – Mr.J Apr 29 '22 at 18:02
1

This sample code does what you want:

hello_cols = []

for col in df.columns:
  if(('index' in col) or ('hello' in col)):
    hello_cols.append(col)

df.select(*hello_cols)
Manrique
  • 2,083
  • 3
  • 15
  • 38
Ali AzG
  • 1,861
  • 2
  • 18
  • 28
  • Thanks, i fixed an error in your code and it worked. – Manrique Nov 21 '18 at 09:45
  • @Antonio Manrique You're right. I wrote my code in python 2.7. Please accept my answer if it was helpful. – Ali AzG Nov 21 '18 at 09:46
  • I will give it an upvote ! But i've found myself a better option for what i am doing, i'll post it as an answer and accept it. But thank you so much ! – Manrique Nov 21 '18 at 09:47
0

i used Manrique answer and improvised.

sel_cols = [i for i in df.columns if i.startswith("colName")]

df = df.select('*', *(F.col(x).alias('rename_text' + x) for x in sel_cols))

Arun Mohan
  • 349
  • 4
  • 13