I needed help with a couple of things.. I am new to NLP and unstructured data cleaning.. can someone answer the following questions... Thanks
- need help with regex to identify words like _male and female_ or more generic like _word and word_ or _something_something_something and get rid of the underscore that is present in the beginning or the end but not in the middle.
- I wanted to know the formal process of cleaning the data, like are there any steps that we have to follow for cleaning unstructured data, im asking this because I am doing lemmatization (with POS) and replacing the commonly occurring words like (something, something) to something_something. so what steps should I follow? I am doing the following right now-tokenize_clean>remove_numbers>remove_url>remove_slash>remove_cross>remove_garbage>replace_hypen_with_underscore>lemmatize_sentence>change_words_to_bigrams>remove_smaller_than_3(words with len smaller then 3)>remove_simlutaneous( words that occurred simultaneously many times eg, death death death)>remove_location>remove_bullets>remove_stop>remove_simlutaneous
Should I do something different in these steps?
- I also have words like (group'shealthplanbecauseeitheroneofthefollowingqualifyingeventshappens) , (whenyouuseanon_networkprovider) ,(per\xad) ,(vlfldq\x10vxshuylvhg) how should I handle them? ignore them completely or try to improve them?
My final goal is to classify the documents into Yes and No class. Any suggestions are welcomed.
Will provide more examples and explanation if required.