-1

how can we store the value of tag<span class="sdr-full-width">हिन्दी</span> that is "हिन्दी" in a variable? I tried with xpath expression to extract it but getting \u0939\u093f\u0928\u094d\u0926\u0940 unicode characters.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
Pradeep Mishra
  • 137
  • 2
  • 12

1 Answers1

0

Then you got it right!

If your environment can display DEVANAGARI symbols, this code:

t = u"\u0939\u093f\u0928\u094d\u0926\u0940"
print t

should display

हिन्दी

with the help of the unicodedata module, I could even express it one character at a time:

>>> for c in t:
    print(c, unicodedata.name(c))


ह DEVANAGARI LETTER HA
ि DEVANAGARI VOWEL SIGN I
न DEVANAGARI LETTER NA
् DEVANAGARI SIGN VIRAMA
द DEVANAGARI LETTER DA
ी DEVANAGARI VOWEL SIGN II

I cannot say more because I really do not understand the meaning of the word...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • It is a language name (हिन्दी) speaks in India. when I tried like t = "\u0939\u093f\u0928\u094d\u0926\u0940" then i get \u0939\u093f\u0928\u094d\u0926\u0940 but when i add prefix u" before string like u"\u0939\u093f\u0928\u094d\u0926\u0940" then it gives proper result, but thing is it is already stored in a variable and I want to display it as हिन्दी. The actual problem is as below item['Languages'] = response.xpath('//p/b/text()').extract_first() and i got \u0939\u093f\u0928\u094d\u0926\u0940 as item['language'] – Pradeep Mishra Aug 18 '17 at 09:19
  • @PradeepMishra: My bad, I forgot the u because I did my test with Python 3.6 that treats strings as unicode. You should try `print(item['Languages'])` and `print(repr(item['Languages']))` and say what are *exactly* the displayed values. – Serge Ballesta Aug 18 '17 at 09:37
  • I am using item in spider so i am using it like yield item and it gives me the same \u0939\u093f\u0928\u094d\u0926\u0940, also i tried it in scrapy shell with print(repr(item['Language'])) and it gives me the same output. – Pradeep Mishra Aug 18 '17 at 09:48
  • @PradeepMishra: at leat `print(repr(item['Language']))` should contains quotation marks. The problem could be caused by extra quotation marks, that's the reason why I asked you for the *exact* displays. – Serge Ballesta Aug 18 '17 at 09:53
  • print(repr(item['Language'])) gives u'\u0939\u093f\u0928\u094d\u0926\u0940' – Pradeep Mishra Aug 18 '17 at 09:56
  • @PradeepMishra: and what gives *exactly* `print(item['Language'])`? – Serge Ballesta Aug 18 '17 at 10:03
  • oh great! it gives the output **हिन्दी** but how can I print this in json or database because in spider I am creating dictionary like item={} and storing each values like item['name'], item['language]..etc. and at the end I am using yield keyword to get item. `scrapy crawl spiderName -o name.json` to get items values in json file – Pradeep Mishra Aug 18 '17 at 10:12
  • @PradeepMishra: If print gives the correct output, it just means that you have extracted the correct unicode string. According to [this other SO post](https://stackoverflow.com/a/4908960/3545273), json strings explicitely contains unicode, so in Json, "u0939\u093f\u0928\u094d\u0926\u0940" is a correct representation for हिन्दी. IMHO, you really have extracted correctly what you want. – Serge Ballesta Aug 18 '17 at 10:24