0

I have a dataframe in Python, listing a bunch of tweets with their id, created time and the tweet id each one has interacted with. e.g. 006 to 004, 002 quoted 999 (999 is an old tweet, not listed here). The table is sorted based on the created time.

+----------+---------------------+-------------+--------------+-----------+
| tweet_id |     created_at      | reply_to_id | retweeted_id | quoted_id |
+----------+---------------------+-------------+--------------+-----------+
|      001 | 2020-02-24 15:51:17 | nan         | 000          | nan       |
|      002 | 2020-02-24 15:52:17 | nan         | nan          | nan       |
|      003 | 2020-02-24 15:53:17 | nan         | nan          | 999       |
|      004 | 2020-02-24 15:54:17 | 001         | nan          | nan       |
|      005 | 2020-02-24 15:55:17 | nan         | nan          | nan       |
|      006 | 2020-02-24 15:56:17 | nan         | 004          | 003       |
|      007 | 2020-02-24 15:57:17 | nan         | nan          | 003       |
|      008 | 2020-02-24 15:58:17 | nan         | nan          | 006       |
|      009 | 2020-02-24 15:59:17 | 006         | nan          | nan       |
|      010 | 2020-02-24 16:00:17 | nan         | 008          | nan       |
+----------+---------------------+-------------+--------------+-----------+

I am trying to write a function to find the interaction history of a single tweet. e.g. 010 retweeted 008, 008 quoted 006, 006 retweeted 004 and also quoted 003, 004 replied to 001, 003 quoted 999. I would like this function to return a list of tweets that traces back 010's history.

In other words, I would like:

input: '010'
output: ['008', '006', '004', '003', '001', '999']

code to generate this toy dataframe:

df = pd.DataFrame(np.array(
        [['001','2020-02-24 15:51:17',np.nan,'000',np.nan],
        ['002','2020-02-24 15:52:17',np.nan,np.nan,np.nan],
        ['003','2020-02-24 15:53:17',np.nan,np.nan,'999'],
        ['004','2020-02-24 15:54:17',np.nan,np.nan,np.nan],
        ['005','2020-02-24 15:55:17',np.nan,np.nan,np.nan],
        ['006','2020-02-24 15:56:17',np.nan,'004',np.nan],
        ['007','2020-02-24 15:57:17',np.nan,np.nan,'003'],
        ['008','2020-02-24 15:58:17',np.nan,np.nan,'006'],
        ['009','2020-02-24 15:59:17','006',np.nan,np.nan],
        ['010','2020-02-24 16:00:17',np.nan,'008',np.nan]]),
        columns = ['tweet_id', 'created_at', 'reply_to_id', 'retweeted_id', 'quoted_id'])

I guess it might involve some recursive search? I could only handle when there is only one type of interaction (if tweets can only reply to each other. Not sure how to handle when 006 interacted with 2 tweets and it kind of creates two branches. Hope to get some help from you guys!

Joseph Yang
  • 49
  • 1
  • 4
  • 1
    You might want to look into the NetworkX library, as this is clearly a graph problem. – AKX Mar 23 '20 at 21:05
  • "How do I implement this application" is not a Stack Overflow issue. Repeat the intro tour, especially the parts of focusing a question. At the moment, you need to research how to build and traverse a tree structure or graph. – Prune Mar 23 '20 at 21:05

0 Answers0