0

I have a large data set of Insurance Claims data with 2 columns. One column is a claim identifier. The other is a large string of notes that go with the claim.

My goal is to text mine the Claims Notes for a specific VIN number. Typically a VIN# is in a 17 digit format. See Below: https://www.autocheck.com/vehiclehistory/autocheck/en/vinbasics

However, with my data, some issues arise. Sometimes only the last 6 digits were input for a VIN#. I basically need a way to process my data and grab anything that looks like a 17 digit VIN Number and return it to that row of data. I am using Python 3 and am a rookie text miner but have some basic experience using regular expressions.

I am trying to create a function in python in which I can lambda apply it to the column of notes.

Attempt so far:

C_Notes['VIN#s'] = C_Notes['ClaimsNotes'].str.findall(r'[0-9]{1}[0-9a-zA-Z]{16}') 

I am trying to mimic the format of the VIN in the link I provided.

So something that looks for a string with following qualities:

EDIT: Changed code snippet. This code example works if I make some toy examples of VINs with made up text but I am not having any success iterating through my pandas column. Each row entry has a large paragraph of text I want the function to go through each row at a time.

Thank you.

Coldchain9
  • 1,373
  • 11
  • 31

2 Answers2

1

But with which VIN system exactly you are dealing?

Wikipedia has article describing 17-digit VIN number, describing three different systems: ISO 3779, European and North American.

Apparently there is not general formal rule dictating what (onlyletter/onlydigit/letterordigit) occupies which position.

First 3 characters depend on manufacturer country/region and first of them is digit for Oceania and both Americas but letter for everyone else.

For North American 9th, 13th, 14th, 15th, 16th and 17th positions are always digits and letters I,O,Q are never used.

Taking in account above considerations following pattern could be used:

[0-9][0-9A-Za-z^IiOoQq]{7}[0-9][0-9A-Za-z^IiOoQq]{3}[0-9]{5}

Which consist of digits ([0-9]) and letterordigits but which are not prohibited characters. ^ inside [] mean that blacklist of characters will follow.

As more general note I advise against guessing regular expression based on limited subset of legal strings.

Daweo
  • 31,313
  • 3
  • 12
  • 25
  • I am in North America. I have edited my regexp to be less strict in the following: '[0-9]{1}[0-9a-zA-Z]{16}' , however I am having issues applying it to a pandas data frame column. Im not getting any matches and I believe I should. Ive tried using re module and also str.extract on the column... not sure what I need. – Coldchain9 May 15 '19 at 18:44
  • Sadly I can not help you there as I do not know `pandas`, but adding `pandas` tag to your question should make it easier to spot for users more competent in this area. – Daweo May 15 '19 at 18:56
0

Well, your code regex doesn't work because violates your specifics. Try with:

^[0-9][a-zA-Z]{2}[0-9a-zA-Z]{5}[a-zA-Z]{3}[0-9]{6}
Neb
  • 2,270
  • 1
  • 12
  • 22