I collect tweets from file has more than 700,000 ids for tweets by using tweepy . Locally I have I7 processor and 16 GB ram and I make for loop on these ides to get the full text , geo , coordinates and place attributes.
And by using tqdm library ,It showed that 2.5 iteration done by second , that is very slow
After that I used sagemaker instances : 1 - ml.m5.4xlarge = 64 GB ram & 16 vCPU 2 - ml.m5.12xlarge = 192 GB ram & 48 vCPU
And the two instances gave me the same results also about 2.5 iteration per second , the same result , how could be ?
Then I used sagemaker studio lab which is free and give me 12 hours per day, it was fastest one it gave me about 6 iteration per second
sample = anger['id'][:100000] tweets = [] geos =[] coords =[] places =[] ids = [] for i in tqdm(sample.index): id = sample[i] # get the tweet #ids.append(id) try: status = api.get_status(id, tweet_mode='extended') text = status._json['full_text'] geo = status.geo coordinates = status.coordinates place = status.place ids.append(id) tweets.append(text) geos.append(geo) coords.append(coordinates) places.append(place) except tweepy.errors.TweepyException as e: #print(i) pass
creating a DataFrame
data = {
'id' : ids, 'text': tweets, 'geo': geos, 'coordinates': coords, 'places':places }
anger_dataframe =pd.DataFrame(data)
anger_dataframe.to_csv('anger_dataframe.csv', encoding='utf-8-sig')
Blockquote
My First Question : why the sagemaker instance with these resources give me the same results and not different from my local resource . And also sagemaker studio lab give me about 6 iteration per second but it not enough for me ,I have 6 million ides to collect.
Second Question : Is there any enhancing in for loop to run faster .
Resources :