-1

I have been mining tweets. The main problem I have been facing is - I have to encode the tweets to UTF-8 and then write them to a file.

My current method:

def on_data(self,data):
    f=open('new','w')
    dict1=json.loads(data)
    val=dict1["text"]
    val= codecs.encode(val,"utf-8","ignore")
    var.x+=1
    f.write(str(var.x)+"\t"+val+"\n")
    return True

Any way to speed up this process?

tshepang
  • 12,111
  • 21
  • 91
  • 136
  • 4
    I *highly* doubt that encoding and saving tweets is the bottleneck. Most likely the API is just slow. Profile to be sure! – nneonneo Jun 26 '14 at 07:44
  • What is the time now? What is the time you expect? How fast does Twitter reply with data? How fast is the medium your writing on? etc, etc (To conclude, there are too many variables involved) – RvdK Jun 26 '14 at 07:47
  • This is unrelated, but since you only need text field you might want to use on_status method – user2963623 Jun 26 '14 at 07:47
  • @nneonneo: When I write it standard output , its pretty fast, like really fast. But slows down drastically when i try to write it to a file. Does it still mean the API is slow ? – sidchelseafan Jun 26 '14 at 07:57
  • 1
    Writing to file might not appear immediately if the write is buffered. So it might seems like it's slower than stdout, while in fact it's much faster. – Kien Truong Jun 26 '14 at 08:08
  • @Dikei: Any ideas to fix this problem ? The difference between writing to a file and stdout is pretty signifcant. Thing is - when i write to stdout I don't have to encode the text to utf-8 and hence I get around hundreds of them every 5-10s but when i write to a file - I have to encode, otherwise i get an EncodeError and hence it slows down pretty drastically. – sidchelseafan Jun 26 '14 at 08:16
  • Are you aware that you're creating a new empty file every time `on_data()` is called? And since you're never closing the file, you have to wait until Python's garbage collection comes around, figures out that the file handle is no longer needed and closes the file for you. Every time. – Tim Pietzcker Jun 26 '14 at 08:21
  • @TimPietzcker: My bad ,that fixed the problem.Thanks for your help and Dikei you too. Thanks a lot !:) – sidchelseafan Jun 26 '14 at 08:33

1 Answers1

1

You're not closing the file, which means you have to wait until Python figures out that it can safely be closed when the file handle isn't in use any more.

Assuming that you actually want to create a new empty file every time on_data() is called, you can use a with statement to have Python close the file for you when the with block is exited:

def on_data(self, data):
    dict1 = json.loads(data)
    val = dict1["text"]
    val = codecs.encode(val,"utf-8", "ignore")
    var.x+=1
    with open('new', 'w') as f:
        f.write(str(var.x) + "\t" + val + "\n")
    return True
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561