-1

I want to scrape data from the “w2 form” (pdf) so that i can use it to save into database but not able to get field wise data.

  1. I have tried “Read PDF text” which reads the whole document fetches all text but i want to find field wise values like,

    Employee’s social security number => 1234 56 7890

    Employer identification number => 11-22334455

  2. I have tried “Screen scraping”, “Data Scraping” but not able to get any specific element.

  3. I have tried “Anchorbase” activity with “Find image” and “Get Text” but not able to select specific element.

Please find attached pdf document for your reference.

W2 Form pdf

Any help will be appreciated.

Thanks.

SK IRT
  • 127
  • 1
  • 11

1 Answers1

0

This is fully readable .pdf file, so this shouldn't be a problem to achieve this. You have to read document text and next use Regex to find what you want. Social security number or Identification number are rather structurized data so you can build regex expression easily. https://regex101.com/ can be helpful for this.

You have to:

  1. Use Read PDF Text activity to get text of .pdf,
  2. Assign activity, create new variable of type System.Text.RegularExpressions.Match
  3. Import namespace: System.Text.RegularExpressions
  4. On the right side of assign use: Regex.Match(readedText, "\d{2}-\d{8}") in quotes there is regular expression for Employer identification number,
  5. If UiPath shows that 'Regex' is not declared, save the workflow, close it, open again, import namespace again, delete assign activity and create it once again.
  6. That's all, in the same way you can find second number.

edit. example.xaml

p0tfur
  • 31
  • 5
  • Hello p0tfur thanks for your valuable help really appreciate it. But you are pointing in same pattern matching technique which will not be the exact/right answer since our data input is parsed pdf data which is going to vary. SSN and EIN are the two fields which have some fixed/standard pattern but suppose both parsed pdf data contains only 9 digit numbers (without any chars/spacing) how we can differentiate among them. – SK IRT Jun 07 '19 at 03:46
  • Also if you look at the pdf fields from numbers 1 to 12 and remaining fields also contain numbers only. We cannot easily detect it by regex pattern matching we need some sound/full proof technique (Anchor Base - https://youtu.be/jncjBCY4Auw video talks about it) which deals with specific element/anchor. Can you please suggest some way to use Get text method as action for anchorbase activity? (In our case get text method is not working since it consider it as image) – SK IRT Jun 07 '19 at 03:46
  • This document (all data in table, data below table are ok)seems not selectable, so this is probably issue. You can try with Recroding -> Desktop -> Text -> Scrape -> Scrape Relative. Probably it will not be working with 'Native' option, but should with 'OCR'. Question is if this is fine for you. – p0tfur Jun 07 '19 at 09:13
  • If you are sure that size of form (resolution and so on) will always be the same, you can use Screen Scraping without finding relative image. – p0tfur Jun 07 '19 at 09:19
  • I have tried with "OCR"/"Full Text" already but it did not work it returns empty string for any element selection – SK IRT Jun 07 '19 at 11:01
  • In post above I added example for you, where I used Find relative image and Get OCR Text, and this works. – p0tfur Jun 07 '19 at 14:14
  • I have tried with "Tesseract OCR" (since i don't have GoogleOCR) but i am getting error "Find Image 'document': Activity timeout exceeded" – SK IRT Jun 10 '19 at 04:31