0

I am scraping some data from a website via python.

I want to do two things

  1. I want to skip the first 2 words "Dubai" and "UAE" which are common in every webscraping result.

  2. I want to save the last two words in two different variables with strip without the extra spaces.

        try:
            area= soup.find('div', 'location')
            area_result= str(area.get_text().strip().encode("utf-8"))
            print "Area: ",area_result
    except StandardError as e:
            area_result="Error was {0}".format(e)
            print area_result
    

area_result: consists of the following data:

'UAE \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Dubai \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Executive Towers \n            \n\n\n        \n\n\n\t    \n\t        \n\t    \n\t\n\n\n        \n        ;\n        \n            \n                \n                    1.4 km from Burj Khalifa Tower'

I want the above result to be displayed as (Note the > between Executive Towers and 1.4 km..

Executive Towers > 1.4 km from Burj Khalifa Tower
user3265370
  • 121
  • 1
  • 2
  • 12
  • 1
    Could you possibly show us the string in it's original format instead of a screenshot? Like so: `UAE >\n Dubai >\n ...`? Also `strip()` is intended to strip things in the beginning and end of strings, – Torxed Mar 28 '14 at 08:41
  • check the edited version – user3265370 Mar 28 '14 at 08:46
  • 1
    Your browser won't show you extraneous whitespace either. – Martijn Pieters Mar 28 '14 at 08:47
  • @user3265370 You're still not giving us a **STRING**, you're giving us screenshots which we're not interested in.. Can you pleaes copy and paste the string **as is** instead of posting screenshots because i'm interested in why there's multiple `\n` and ` ` all over your data and how the raw data looks like, also i'd like to copy and paste that data into my environment so i can work with it. I can't copy a string from a screenshot. **Note:** That last screenshot doesn't match the first one, there's no `Dubai Festival City;` in the first screenshot even tho that doesn't matter much. consistency! – Torxed Mar 28 '14 at 08:49
  • check the edited version please – user3265370 Mar 28 '14 at 08:53
  • @user3265370 No that's not the correct string.. Obviously that was taken from the browser and not your own code. Do this instead `print([area_result])`! (A logical mind would think that the browser is doing something to the string, and if the problem is with the code then i should copy the string from the code and not from the end where it's working?) – Torxed Mar 28 '14 at 08:53
  • check the edited version again please – user3265370 Mar 28 '14 at 08:58
  • @user3265370 Please do `print([area_result])` and post **that** result... – Torxed Mar 28 '14 at 08:59
  • I am already posting that. i am using sublime and posting the result of my console screen – user3265370 Mar 28 '14 at 09:08
  • I'm using sublime and this is what my output looks like `['Area: UAE \u202a>\u202a\n\n Dubai \u202a>\u202a\n\n JLT Jumeirah Lake Towers \n\n\n\n\n\n\n\n\n\n\n\n\n\n ;\n\n\n\n 1.4 km from Marina Walk']`, The **difference** is that i'm printing with `[]` around the string and i leave the data as is, and if your sublime works differently run the code from a PROPER terminal... – Torxed Mar 28 '14 at 09:09
  • @user3265370 FINALLY!!!! – Torxed Mar 28 '14 at 09:13
  • yes. i got it. i am sorry for the mistake. please check the edit – user3265370 Mar 28 '14 at 09:13
  • @user3265370 Check my edit on my answer, solves your problem (verified) – Torxed Mar 28 '14 at 09:18
  • @user3265370 You're not running the latest version of my code.. because there's \t in there among other things. – Torxed Mar 28 '14 at 09:21
  • done! solves it perfectly! Thanks alot – user3265370 Mar 28 '14 at 09:27

2 Answers2

2
area_result = area_result.replace("UAE", "")
area_result = area_result.replace("Dubai", "")
area_result =  area_result.strip()

Using regular expression:

import re
area_result = re.sub('\s+',' ',area_result)
area_result = area_result.replace("UAE ‪>‪ Dubai ‪>‪", "")
area_result =  area_result.strip()
salmanwahed
  • 9,450
  • 7
  • 32
  • 55
0
import string
def cleanup(s, remove=('\n', '\t')):
    newString = ''
    for c in s:
        # Remove special characters defined above.
        # Then we remove anything that is not printable (for instance \xe2)
        # Finally we remove duplicates within the string matching certain characters.
        if c in remove: continue
        elif not c in string.printable: continue
        elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
        newString += c
    return newString

Throw something like that in there in order to cleanup your code?
The net result is:

>>> s = 'UAE \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Dubai \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Executive Towers \n            \n\n\n        \n\n\n\t    \n\t        \n\t    \n\t\n\n\n        \n        ;\n        \n            \n                \n                    1.4 km from Burj Khalifa Tower'
>>> cleanup(s)
'UAE > Dubai > Business Bay > Executive Towers 1.4 km from Burj Khalifa Tower'

Here's a good SO reference to the string library.

Going back to the question is see that the user don't want the first two blocks (between >) to be present, quite simply do:

area_result = cleanup(area_result).split('>')[3].replace(';', '>')
Community
  • 1
  • 1
Torxed
  • 22,866
  • 14
  • 82
  • 131
  • i want something which can be put in the line block of code which i already wrote – user3265370 Mar 28 '14 at 08:55
  • @user3265370 You put this at the top of your code, and all you have to do is: `print "Area: ",cleanup(area_result)` – Torxed Mar 28 '14 at 08:56
  • done. but the result is still not well formatted. please check the result in the edited version – user3265370 Mar 28 '14 at 09:02
  • @user3265370 Please post what i'm asking you to post and not some own version of the data... Because it looks like you have unicode characters in there which needs to be delt with... – Torxed Mar 28 '14 at 09:03
  • I am sorry but what exactly you want me to post? – user3265370 Mar 28 '14 at 09:07
  • Save your code, open up a terminal, run the code where you **keep** the `[ ]` around `area_result` so that we can get a string that looks like this: `['Area: UAE \u202a>\u202a\n\n Dubai \u202a>\u202a\n\n JLT Jumeirah Lake Towers \n\n\n\n\n\n\n\n\n\n\n\n\n\n ;\n\n\n\n 1.4 km from Marina Walk']` – Torxed Mar 28 '14 at 09:11
  • @user3265370 also check my latest edit and see if that fixes it. – Torxed Mar 28 '14 at 09:11
  • check the result please. the 3rd and forth word still have some spaces – user3265370 Mar 28 '14 at 09:22
  • @user3265370 You didn't run the latest version of the code, refresh this page and copy my code again. Because there's `\t` (which is a tab) and it shouldn't be there if you run the latest code. – Torxed Mar 28 '14 at 09:23
  • @user3265370 You're welcome, next time post what we ask you to post.. Not some glorified version of it. The **truth** is always better than a made up story.. goes for data as well. – Torxed Mar 28 '14 at 09:27
  • sorry for that. i missed the [ ] – user3265370 Mar 28 '14 at 09:28
  • one more request. can u add a delimiter between 3rd and 4rth word? so that later i can use excel to separate them easily – user3265370 Mar 28 '14 at 09:30
  • i want `>` between word 3 and word 4 – user3265370 Mar 28 '14 at 09:32
  • @user3265370 You mean between `'Business Bay > Executive Towers 1.4 km from Burj Khalifa Tower'`? Or which do you refer to as `3` and `4` word? Write the string as you want it to look like, and i'll make it happen. So write the string in `sublime` as you want it to look like and i'll code it... – Torxed Mar 28 '14 at 09:34
  • Sorry for late reply. I want a `>` between Executive Towers and 1.4 km from Burj Khalifa Tower' it looks like Executive Towers `>` 1.4 km from Burj Khalifa Tower' – user3265370 Mar 28 '14 at 10:38
  • I have edited the question from you to make it more clear to understand – user3265370 Mar 28 '14 at 10:42
  • @user3265370 fixed it, it's quite simple, I just made it so that ';' doesn't get removed, that way you can use that as a "split" character. or do `cleanup(area_results).replace(';', '>')` or you can do the `replace` inside `cleanup`, up to you. – Torxed Mar 28 '14 at 10:44
  • can i ask one more question please? – user3265370 Mar 29 '14 at 08:32
  • http://stackoverflow.com/questions/22728966/removing-the-unwanted-spaces-from-string-via-python?noredirect=1#comment34640004_22728966 – user3265370 Mar 29 '14 at 08:57