OCR, tesseract.js: How do I match values to labels?

Question

I'm using tesseract.js to get text from a W2 form. I'm having trouble trying to figure out how I could get match up the values of the form to the labels. Like How can I match up the label 'employee social security number' with the value social security number?

If you know the W2 image will always have the same layout, you can split the image up and run tesseract on each piece. That's what I did in one of my side-projects: https://github.com/SidneyNemzer/siege-stats/blob/master/src/components/App.js#L82-L151. You'll have to manually determine the coordinates of each field, of course. I just did that by trial-and-error. — Sidney, Sep 13 '17 at 23:01
@Sidney Thanks for the response. Can you explain to me what the subcanvas and subContext is for ? — jackjoesmith, Sep 14 '17 at 20:44
Are you familiar with [canvas](https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API)? In my app, canvas is used to grab pieces of an image, then pass them to Tesseract. There's a large canvas (`scoreboardCanvas`) that holds the main image. The `subCanvas` is used to hold the sub-section of the image that's being worked on. It's reused for every piece (there's 10 'subsections' in my app). A `context`, in regards to canvas, is the object that can manipulate the pixels of the canvas. — Sidney, Sep 14 '17 at 21:02
@Sidney Thanks! i took your advice and cut up the image with canvas. Then I scanned them all individually. — jackjoesmith, Sep 18 '17 at 20:11

OCR, tesseract.js: How do I match values to labels?

0 Answers0