0

I scraped some online data using Twitter scraper. I know I can filter this fairly easily using excel, and I did export the data to an xlsx. But, I want to filter using Python. I scraped data containing Hurricane Dorian. Also, I want to filter everything that does not include the word "Bahamas"in it. How would I do this?

Thank you!

from twitterscraper import query_tweets
import datetime as dt
import pandas as pd

begin_date = dt.date(2019, 7, 1)
end_date = dt.date(2019, 9, 9)

limit = 1000
lang = 'english'

tweets = query_tweets('Hurricane Dorian', begindate = begin_date, enddate = end_date, limit = limit, lang = lang)

df = pd.DataFrame(t.__dict__ for t in tweets)

export_excel = df.to_excel (r'C:\Users\victo\Desktop\HurricaneData.xlsx', index = None, header=True)
Hedgy
  • 354
  • 1
  • 3
  • 16
Victorb37
  • 7
  • 2
  • I think it's time for you to learn regex. It's a very versatile text filtering option and is needed often in python. https://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex https://regex101.com/ – KWx Sep 08 '19 at 23:47

1 Answers1

0

You can use the str functions in pandas to filter. See pandas help on indexing. Here's the specific answer (code) for your posted questions:

from twitterscraper import query_tweets 
import datetime as dt 
import pandas as pd

begin_date = dt.date(2019, 7, 1) 
end_date = dt.date(2019, 9, 9)

limit = 1000 
lang = 'english'

tweets = query_tweets(
    'Hurricane Dorian', 
    begindate = begin_date, 
    enddate = end_date, 
    limit = limit, 
    lang = lang
)

# Convert to dataframe
df = pd.DataFrame(t.__dict__ for t in tweets)

# make a boolean mask
filt = df['text'].str.contains('Bahamas')

# compare the lengths of the dataframes
print(df.shape)
print(df.loc[filt].shape)

You can see the unfiltered df has 340 rows. Restricting it to rows where the text had 'Bahamas' reduced it to 55 rows.

(340, 16)

(55, 16)

To keep the ones that were true, reassign it using the filter:

df = df.loc[filt]

Or you could assign it to a new dataframe if you want to preserve the original raw data.

Randall Goodwin
  • 1,916
  • 2
  • 18
  • 34
  • This did work to a certain extent. It was able to determine which contained the string, Within the variable explorer on the "filt" tab is shows all the data and if it is true or false. If I only wanted to keep the ones that were true, along with its text and all other data, how would I do that? – Victorb37 Sep 09 '19 at 01:30
  • Updated the answer to show how keep the rows that were true. – Randall Goodwin Sep 09 '19 at 03:54