0

I have a collection of about 1.4 million tweets in a MongoDB collection. I want to find all that are NOT retweets, and am using Python. The structure of a document is as follows:

{
  '_id': ObjectId('59388c046b0c1901172555b9'), 
  'coordinates': None, 
  'created_at': datetime.datetime(2016, 8, 18, 17, 17, 12),
  'geo': None,
  'is_quote': False,
  'lang': 'en',
  'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s',
  'tw_id': 766323071976247296,
  'user_id': 2231233110,
  'user_lang': 'en',
  'user_loc': 'main; @Kan1shk3',
  'user_name': 'sheezy0',
  'user_timezone': 'Chennai'
}

I can write a query that works to find the particular tweet from above:

twitter_mongo_collection.find_one({
  'text': b'Adam Cole Praises Kevin Owens + A Preview For Next Week\xe2\x80\x99s'
})

But when I try to find retweets, my code doesn't work, for example I try to find any tweets that start like this:

'text': b'RT some tweet'

Using this query:

find_one( {'text': {'$regex': "/^RT/" } }  )

It doesn't return an error, but it doesn't find anything. I suspect it has something to do with that 'b' at the beginning before the text starts. I know I also need to put '$not:' in there somewhere but am not sure where.

Thanks!

  • I forgot to mention that I also tried this: regx = re.compile("^RT") twitter_mongo_collection.find_one({"text": regx}) and it didn't return anything. – Emma Freeman Jun 08 '17 at 03:19
  • When using `$regex` with a "string" you don't include the slashes, so `{ '$regex': '^RT' }` is the correct syntax. Usage with `re.compile` should also be correct though. Can you at least identify a document you think you "should match"? Showing that might point to the problem. – Neil Lunn Jun 08 '17 at 03:26
  • Thanks! It still doesn't work with the slashes removed. A document that should match is : 'text': b'RT @hewittsprints: Professional, Personalised, Design and Print Service Cards available to order'. I'm not sure what the 'b' is, but since it was necessary for my other search query to work, I'm assuming it's important here too, I just don't know how to include it. – Emma Freeman Jun 08 '17 at 03:29
  • It means it's a byte string literal. I don't "think" it should have an effect. What about just testing for `"RT"` ( as a test ) in case it's not in fact the actual start, so drop the anchor from the regex. I also thought twitter feeds actually had a boolean value for "retweet", which maybe has been stripped from your data but really should be there since that would be better than a regex. – Neil Lunn Jun 08 '17 at 03:33
  • Maybe `re.compile(b'^RT')`? Taken from [Regular expression parsing a binary file?](https://stackoverflow.com/questions/5618988/regular-expression-parsing-a-binary-file) – Neil Lunn Jun 08 '17 at 03:44

2 Answers2

0

It looks like your regex search is trying to match the string
b'RT'
but you want to match strings like
b'RT some text afterwards'

try using this regex instead
find_one( {'text': {'$regex': "/^RT.*/" } } )

0

I had to decode the 'text' field that was encoded as binary. Then I was able to use

twitter_mongo_collection.find_one( { {'text': { '$not': re.compile("^RT.*") } } )

to find all the documents that did not start with "RT".