-2

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.

OR Please suggest any other opensource tools and techniques do this?

enter image description here

AskMe
  • 2,495
  • 8
  • 49
  • 102
  • Could you copy-paste a dozen messy data, so that we can import them into Open Refine? Subsidiary question: Is it just to eliminate the part that looks like a telephone number? – Ettore Rizza Mar 26 '18 at 18:53
  • There are phone numbers along with driving licence, passport, ID, phone number with extension etc (in a very odd format). Please note that the name has the number also(which should not be removed). Refer screenshot please. – AskMe Mar 26 '18 at 19:09
  • Question is now updated with few data. – AskMe Mar 26 '18 at 19:25
  • Sorry, I'm afraid that if you don't know how to export 15 lines in CSV and copy-paste them, I'm not sure you'll be able to clean this ugly file. – Ettore Rizza Mar 26 '18 at 19:41
  • Ha Ha...Please see now.Im sorry.The file may be available for small duration. – AskMe Mar 26 '18 at 19:59
  • I wouldn't expect OpenRefine to be particularly slow for 300k records but OpenRefine scales with the amount of memory allocated so please see https://github.com/OpenRefine/OpenRefine/wiki/FAQ%3A-Allocate-More-Memory to increase the memory available To help with the overall problem I think more detail is needed. Especially what problems specifically you are trying to solve and what you have tried so far - I'm afraid "how can I clean this data" is too broad a question to answer. Recommend using OpenRefine discussion group for general Qs https://groups.google.com/forum/#!forum/openrefine – Owen Stephens Mar 27 '18 at 08:28

1 Answers1

0

As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.

In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).

On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:

value.trim().replace(/\b\w+#\w+\b$/,'')

Here is a screencast if it's not clear to you.

enter image description here

But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.

if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)

And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)

Feel free to check out the Open Refine documentation in case of doubt.

Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23