0

I needed help with a couple of things.. I am new to NLP and unstructured data cleaning.. can someone answer the following questions... Thanks

  1. need help with regex to identify words like _male and female_ or more generic like _word and word_ or _something_something_something and get rid of the underscore that is present in the beginning or the end but not in the middle.
  2. I wanted to know the formal process of cleaning the data, like are there any steps that we have to follow for cleaning unstructured data, im asking this because I am doing lemmatization (with POS) and replacing the commonly occurring words like (something, something) to something_something. so what steps should I follow? I am doing the following right now-tokenize_clean>remove_numbers>remove_url>remove_slash>remove_cross>remove_garbage>replace_hypen_with_underscore>lemmatize_sentence>change_words_to_bigrams>remove_smaller_than_3(words with len smaller then 3)>remove_simlutaneous( words that occurred simultaneously many times eg, death death death)>remove_location>remove_bullets>remove_stop>remove_simlutaneous

Should I do something different in these steps?

  1. I also have words like (group'shealthplanbecauseeitheroneofthefollowingqualifyingeventshappens) , (whenyouuseanon_networkprovider) ,(per\xad) ,(vlfldq\x10vxshuylvhg) how should I handle them? ignore them completely or try to improve them?

My final goal is to classify the documents into Yes and No class. Any suggestions are welcomed.

Will provide more examples and explanation if required.

Karan Kothari
  • 91
  • 2
  • 12
  • 1
    There is no "formal" corpus clean-up process. You should check the corpora you have, compile a list of the "problematic things", then think of what you want to do: 1) fix (if the number of segments affected is large), 2) remove (if the quantity is not that big). It is easier to remove the "bad" data than fix unless there is a clear and safe automatic way. The rest of your question is too broad. Start working on the clean up, and then come back with more concrete questions. – Wiktor Stribiżew Dec 21 '16 at 15:21
  • @WiktorStribiżew Thank... Can you help with the regex? the 1st bullet – Karan Kothari Dec 21 '16 at 15:22
  • Maybe [`\b(?:(\w+)_+|_+(\w+))\b` -> `$1$2`](https://regex101.com/r/LceAN6/1)? Or [`\b_*(\w+?)_*\b` -> `$1`](https://regex101.com/r/LceAN6/2). – Wiktor Stribiżew Dec 21 '16 at 15:28
  • https://docs.python.org/2/library/re.html contains a complete list of regex classes you can use. For example, to remove underscores in `_word`, `word_` and `_`, you could use `re.sub(r'(^|\s)_(\S)',r'\1\2','_word')`. For underscores after words: `re.sub(r'(\S)_($|\s)',r'\1\2','_word')`. See also http://stackoverflow.com/questions/525635 – akraf Dec 21 '16 at 15:36
  • @WiktorStribiżew that site in your regex links is quite useful. Thanks. In Python, you would translate `\b_*(\w+?)_*\b` -> `$1$2` as `re.sub( r'\b_*(\w+?)_*\b' , r'\1\2' , THESTRINGTOCLEAN )`. – akraf Dec 21 '16 at 15:40
  • @WiktorStribiżew is right about there's no formal ways but there are tried before ways, (Disclaimer: shameless plug ahead) e.g. : https://gist.github.com/alvations/55d78f627ac8bac0bf34 – alvas Dec 22 '16 at 01:14

1 Answers1

0
  1. Must the regular expression allows something __abc__? If not, (\b_[a-zA-Z]+\s)|(\s[a-zA-Z]+_\b)|(\s_[a-zA-Z]+_\b)

  2. What problem do you solve? Do you prepare texts for classification etc.?

  3. You have to distinguish mistakes and symbol sequences. There are some scientific ways to make this, for example comparison with corpora words, annotated suffix trees, etc.

Dmitry
  • 2,026
  • 1
  • 18
  • 22
  • 1. I used this for the regular expression `def remove_underscore(text): text = re.sub(r'(^|\s)_(\S)',r'\1\2',text) return re.sub(r'(\S)_($|\s)',r'\1\2',text)` I am preparing text for classification and wanted to know if I am losing features in this or its ok to let the bad data go. – Karan Kothari Dec 21 '16 at 16:16