-3

I have a csv file with many tweets. I trying to get two specifics texts and make a dataframe with this information: date, hashtag

Created At,Text
Fri Jan 06 11:02:14 +0000 2017, #beta #betalab #mg Afiliada da Globo: Apresentador no AM é demitido após criticar governador

I would like to have this result:

Below is one of many ways i tried but no matter, the results isn't what i need.

I tried exactly the code below

import os
os.chdir(r'C:\Users\Documents')
dataset = pd.read_csv('Tweets_Mg.csv', encoding='utf-8')
dataset.drop_duplicates(['Text'], inplace=True)


def Preprocessing(instancia):

    stemmer = nltk.stem.RSLPStemmer()
    
    instancia = re.sub(r"http\S+", "", instancia).lower().replace('?','').replace('!','').replace('.','').replace(';','').replace('-','').replace(':','').replace(')','')
    
    #List of stopwords in portuguese language
    stopwords = set(nltk.corpus.stopwords.words('portuguese'))
    palavras = [stemmer.stem(i) for i in instancia.split() if not i in stopwords]
    
    return (" ".join(palavras))

tweets = [Preprocessing(i) for i in dataset.Text]



def procurar_hashtags(tweet):
 
    return re.findall('(#[A-Za-z]+[A-Za-z0-9-_]+)', tweet)
hashtag_list = [procurar_hashtags(i) for i in tweets] 


def hashtag_top(hashtag_list):
    hashtag_df = pd.DataFrame(hashtag_list)
    hashtag_df = pd.concat([hashtag_df[0],hashtag_df[1],hashtag_df[2],
                           hashtag_df[3],hashtag_df[4],hashtag_df[5],
                           hashtag_df[6],hashtag_df[7],
                           hashtag_df[8]], ignore_index=True)
    
    hashtag_df = hashtag_df.dropna()
    hashtag_df = pd.DataFrame(hashtag_df)
    hashags_unicas = hashtag_df[0].value_counts()
    
    return hashags_unicas

 hashtag_dataframe = hashtag_top(hashtag_list)
 hashtag_dataframe[hashtag_dataframe>=25]

The result is not good, no matter what I do, I can't capture the dates from the hashtags. Like this:

 #timbet                  193
 #glob                    119
 #operacaobetalab         118
 #sigodevolt               77

I doing something wrong...

Gizelly
  • 417
  • 2
  • 10
  • 24
  • 4
    Pure code-writing requests are off-topic on Stack Overflow — we expect questions here to relate to *specific* programming problems — but we will happily help you write it yourself! Tell us [what you've tried](https://stackoverflow.com/help/how-to-ask), and where you are stuck. This will also help us answer your question better. – Libra Jan 07 '20 at 22:21
  • 1
    Welcome to StackOverflow. [On topic](https://stackoverflow.com/help/on-topic), [how to ask](https://stackoverflow.com/help/how-to-ask), and ... [the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. StackOverflow is a knowledge base for *specific* programming problems -- not a design, coding, research, or tutorial resource. – Prune Jan 07 '20 at 22:24
  • I have no idea how to do this. I haven't tried anything yet because I don't know what to do. I didn't find anything on the internet about this. After extracting dates and hashtags I would group by hashtag and plot a time series. Excuse me. – Gizelly Jan 07 '20 at 22:37
  • 1
    Show what you've tried, then people will help you when you're stuck that's StackOverflow. – abdoulsn Jan 07 '20 at 22:38
  • 1
    Just posted an answer to serve as a starting point but you are going to do better next time you ask something here. – accdias Jan 07 '20 at 22:49
  • 1
    Thank you! This helpe me. I will working in that. – Gizelly Jan 07 '20 at 22:59

1 Answers1

1

You can use this as a starting point:

from itertools import product
from pathlib import Path
import csv
import re

hashtag = re.compile('(#\w+)')

csvfile = Path('/path/to/your/file.csv')

tags_by_date = []

for line in csv.reader(csvfile.open()):
    tags = hashtag.findall(line[1])
    if tags:
        for date, tag in product(line[0], tags):
            tags_by_date.append([date, tag])

And here is a small proof of concept (far from a complete solution since you didn't take the time to elaborate your question in a better way):

>>> line
['Fri Jan 06 11:02:14 +0000 2017', ' #beta #betalab #mg Afiliada da Globo: Apresentador no AM é demitido após criticar governador']
>>> hashtag.findall(line[1])
['#beta', '#betalab', '#mg']
accdias
  • 5,160
  • 3
  • 19
  • 31
  • i did my code with your recommendations. It's worked! I did some modifications and i used with pandas dataframe in the process. My question is blocked now. If it is unblocked in the future i will post the complete code. I've already edited my question and i'm waiting. Thanks. – Gizelly Jan 10 '20 at 12:11
  • 1
    @Gizélly, no worries. You can reopen it if you want but I think it is not necessary. I'm glad it worked. – accdias Jan 10 '20 at 12:15