Regular expression with Lao?

Question

In Python I would like to display only Lao Characters in this HTML code (just only in "textarea" tag):

<font color="Red">ພິມຄໍາສັບລາວ ຫຼື ອັງກິດແລ້ວກົດປຸ່ມຄົ້ນຫາ - Enter English or Lao Then Hit Search</font><br />
<center><table id='display' border='0' width='100%'>
  <tr>
    <td id='lao2' colspan='3' style='height: 18px; text-align: left'>
      <span style='color: #660033'><span style='font-size: 12pt'>&nbsp;&nbsp;&nbsp;</span></span>&nbsp;&nbsp;
    </td>
  </tr>
  <tr>
    <td style='width: 120px'>&nbsp;</td>
    <td style='width: 192px'>
      <textarea ID='lao' Font-Name='Phetsarath OT' Font-Size='12' rows='10' cols='84' readonly='readonly'>
    1.  (loved, loving)
      1. ຮັກ
      2. ມັກຫຼາຍ
      3. would love ຢາກໄດ້ຫຼາຍ, ຢາກເຮັດຫຼາຍ
      ປະເພດ: ຄໍາກໍາມະ
      ການອອກສຽງ: ເລັຟ

    2.
      1. ຄວາມຮັກ
      2. ຄົນຮັກ, ຄູ່ຮັກ, ສິ່ງທີ່ເຈົ້າຮັກ
      3. ທີ່ຮັກ, (ເທັນນິດ) ສູນ
      be in love with ຮັກຜູ້ໃດຜູ້ໜຶ່ງ
      make love ຮ່ວມປະເວນີ
      ປະເພດ: ຄຳນາມ
      ການອອກສຽງ: ເລັຟ
      </textarea>
    </td>
    <td style='width: 284px'>&nbsp;&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td>&nbsp;</td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td id='lao1' align='center'>ກະຊວງ ໄປສະນີ, ໂທລະຄົມມະນາຄົມ ແລະ ການສື່ສານ</td><td>&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td id='lao1' align='center'>ສູນບໍລິຫາລັດດ້ວຍເອເລັກໂຕຣນິກ</td><td>&nbsp;</td>
  </tr>
</table></center><br />

I just want the value in the "textarea". What should I do?

@dda: I don't think you should go around and reformat example HTML; especially inside the textarea, whitespace is significant and you changed the contents by adding indentation and newlines where there were none before. — Martijn Pieters, May 09 '13 at 14:47

Martijn Pieters · Accepted Answer · 2013-05-09T14:22:55.397

Don't use a regular expression. Use a HTML parser. BeautifulSoup makes the task easy:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmltext)
text = soup.find('textarea', id='lao').string

If you then need to limit the result to just Lao characters, you can further process the text variable.

However, the Python re module isn't that strong (yet) when it comes to Unicode. Your options are to use a regular expression to just grab code points in the range 0E80–0EFF, use the unicodedata module and filter on the unicode codepoint name, or use the regex library to only match Lao characters.

Using a regular expression:

import re

lao_codepoints = re.compile(ur'[\u0e80-\u0eff]', re.UNICODE)
lao_text = u''.join(lao_codepoints.findall(text))

Demo:

>>> print u''.join(lao_codepoints.findall(text))
ຮັກມັກຫຼາຍຢາກໄດ້ຫຼາຍຢາກເຮັດຫຼາຍປະເພດຄໍາກໍາມະການອອກສຽງເລັຟຄວາມຮັກຄົນຮັກຄູ່ຮັກສິ່ງທີ່ເຈົ້າຮັກທີ່ຮັກເທັນນິດສູນຮັກຜູ້ໃດຜູ້ໜຶ່ງຮ່ວມປະເວນີປະເພດຄຳນາມການອອກສຽງເລັຟ

Using the unicodedata module:

import unicodedata

loa_text = u''.join([ch for ch in text if unicodedata.name(ch, '').startswith('LAO')])

Demo:

>>> print u''.join([ch for ch in text if unicodedata.name(ch, '').startswith('LAO')])
ຮັກມັກຫຼາຍຢາກໄດ້ຫຼາຍຢາກເຮັດຫຼາຍປະເພດຄໍາກໍາມະການອອກສຽງເລັຟຄວາມຮັກຄົນຮັກຄູ່ຮັກສິ່ງທີ່ເຈົ້າຮັກທີ່ຮັກເທັນນິດສູນຮັກຜູ້ໃດຜູ້ໜຶ່ງຮ່ວມປະເວນີປະເພດຄຳນາມການອອກສຽງເລັຟ

Using the regex module:

import regex

lao_codepoints = regex.compile(ur'\p{Lao}', regex.UNICODE)
lao_text = u''.join(lao_codepoints.findall(text))

Demo:

>>> print u''.join(lao_codepoints.findall(text))
ຮັກມັກຫຼາຍຢາກໄດ້ຫຼາຍຢາກເຮັດຫຼາຍປະເພດຄໍາກໍາມະການອອກສຽງເລັຟຄວາມຮັກຄົນຮັກຄູ່ຮັກສິ່ງທີ່ເຈົ້າຮັກທີ່ຮັກເທັນນິດສູນຮັກຜູ້ໃດຜູ້ໜຶ່ງຮ່ວມປະເວນີປະເພດຄຳນາມການອອກສຽງເລັຟ

nice, but he wants to have `only Lao Characters`. I think regex is needed? — Kent, May 09 '13 at 13:55
I don't really know what OP wants, but I think this answer is reasonable. The text wouldn't make sense with just Lao characters - there are also spaces and common characters also. — nhahtdh, May 09 '13 at 13:58
@Kent: I am not 100% certain that he only wants the Lao unicode points from the text area; `I just want a value in "textarea" tag`. — Martijn Pieters, May 09 '13 at 13:58
@MartijnPieters Thank you for Advance and if i need all text in "textarea" text what should i do sir ? — Frank Xayachack, May 09 '13 at 14:58
@FrankXayachack: That's what the first piece of code does. You can ignore the rest about extracting just the Lao codepoints in that case. `soup.find('textarea', id='lao').string` gives you all the text inside the textarea with the `id="lao"` attribute. — Martijn Pieters, May 09 '13 at 17:08

Regular expression with Lao?

1 Answers1