how can we store the value of tag<span class="sdr-full-width">हिन्दी</span>
that is "हिन्दी" in a variable? I tried with xpath expression to extract it but getting \u0939\u093f\u0928\u094d\u0926\u0940
unicode characters.
Asked
Active
Viewed 35 times
-1

Serge Ballesta
- 143,923
- 11
- 122
- 252

Pradeep Mishra
- 137
- 2
- 12
1 Answers
0
Then you got it right!
If your environment can display DEVANAGARI symbols, this code:
t = u"\u0939\u093f\u0928\u094d\u0926\u0940"
print t
should display
हिन्दी
with the help of the unicodedata module, I could even express it one character at a time:
>>> for c in t:
print(c, unicodedata.name(c))
ह DEVANAGARI LETTER HA
ि DEVANAGARI VOWEL SIGN I
न DEVANAGARI LETTER NA
् DEVANAGARI SIGN VIRAMA
द DEVANAGARI LETTER DA
ी DEVANAGARI VOWEL SIGN II
I cannot say more because I really do not understand the meaning of the word...

Serge Ballesta
- 143,923
- 11
- 122
- 252
-
It is a language name (हिन्दी) speaks in India. when I tried like t = "\u0939\u093f\u0928\u094d\u0926\u0940" then i get \u0939\u093f\u0928\u094d\u0926\u0940 but when i add prefix u" before string like u"\u0939\u093f\u0928\u094d\u0926\u0940" then it gives proper result, but thing is it is already stored in a variable and I want to display it as हिन्दी. The actual problem is as below item['Languages'] = response.xpath('//p/b/text()').extract_first() and i got \u0939\u093f\u0928\u094d\u0926\u0940 as item['language'] – Pradeep Mishra Aug 18 '17 at 09:19
-
@PradeepMishra: My bad, I forgot the u because I did my test with Python 3.6 that treats strings as unicode. You should try `print(item['Languages'])` and `print(repr(item['Languages']))` and say what are *exactly* the displayed values. – Serge Ballesta Aug 18 '17 at 09:37
-
I am using item in spider so i am using it like yield item and it gives me the same \u0939\u093f\u0928\u094d\u0926\u0940, also i tried it in scrapy shell with print(repr(item['Language'])) and it gives me the same output. – Pradeep Mishra Aug 18 '17 at 09:48
-
@PradeepMishra: at leat `print(repr(item['Language']))` should contains quotation marks. The problem could be caused by extra quotation marks, that's the reason why I asked you for the *exact* displays. – Serge Ballesta Aug 18 '17 at 09:53
-
print(repr(item['Language'])) gives u'\u0939\u093f\u0928\u094d\u0926\u0940' – Pradeep Mishra Aug 18 '17 at 09:56
-
@PradeepMishra: and what gives *exactly* `print(item['Language'])`? – Serge Ballesta Aug 18 '17 at 10:03
-
oh great! it gives the output **हिन्दी** but how can I print this in json or database because in spider I am creating dictionary like item={} and storing each values like item['name'], item['language]..etc. and at the end I am using yield keyword to get item. `scrapy crawl spiderName -o name.json` to get items values in json file – Pradeep Mishra Aug 18 '17 at 10:12
-
@PradeepMishra: If print gives the correct output, it just means that you have extracted the correct unicode string. According to [this other SO post](https://stackoverflow.com/a/4908960/3545273), json strings explicitely contains unicode, so in Json, "u0939\u093f\u0928\u094d\u0926\u0940" is a correct representation for हिन्दी. IMHO, you really have extracted correctly what you want. – Serge Ballesta Aug 18 '17 at 10:24