How to download the content of an url in a pandas dataframe with python-twitter?

Question

I have an xml like this:

<author ="twitter" lang="english" type="xx" age_misc="xx" url="https://twitter.com/Carmen_RRHH">
    <documents count="436">
        <document id="106259332342342348513" url="https://twitter.com/Carmen_RRHH/status/106259338234048513">       </document>
        <document id="232342342342323423" url="https://twitter.com/Carmen_RRHH/status/106260629999992832">      </document>
        <document id="107084815504908291" url="https://twitter.com/Carmen_RRHH/status/107084815504908291">      </document>
        <document id="108611036164276224" url="https://twitter.com/Carmen_RRHH/status/108611036164276224">      </document>
        <document id="23423423423423" url="https://twitter.com/Carmen_RRHH/status/108611275851956224">      </document>
        <document id="109283650823423480806912" url="https://twitter.com/Carmen_RRHH/status/109283650880806912">        </document>
        <document id="10951489623423290488320" url="https://twitter.com/Carmen_RRHH/status/109514896290488320">     </document>
        <document id="1095159513234234355080704" url="https://twitter.com/Carmen_RRHH/status/109515951355080704">       </document>
        <document id="96252622234239511966720" url="https://twitter.com/Carmen_RRHH/status/96252629511966720">      </document>
    </documents>
</author>

Is it possible to get the content of this links and place them into a pandas dataframe?, any idea of how to aproach this task?. Thanks in advance.

score 3 · Answer 1 · answered Feb 04 '15 at 07:12

3

You have access to python, requests is a good choice:

import requests
r = requests.get("https://twitter.com/Carmen_RRHH/status/106259338234048513")

r.contents # the html

However, to get them into a pandas DataFrame this contents needs to be structured (like a table), which generally it's not going to be...

I recommend looking into the twitter api, or an existing twitter-client (for python) e.g. https://github.com/bear/python-twitter, that way you can extract the features you want cleanly (to columns) rather than munging them from html.

answered Feb 04 '15 at 07:12

Andy Hayden

359,921
101
625
535

Thanks for the feedback other thing to comes to y mind is I would like to scrap several urls, how can I avoid the twitter ban?. Do you think this is possible(get banned by twitter) – newWithPython Feb 05 '15 at 05:15
1

@newWithPython I think it depends on how much you're downloading IIRC the limit is relatively high, for that aspect specifically may be best answered as a separate question - it'll get more eyes on it. – Andy Hayden Feb 05 '15 at 05:52
Thank you very much, I will update the status of this and ask another question. – newWithPython Feb 05 '15 at 06:00

How to download the content of an url in a pandas dataframe with python-twitter?

1 Answers1