0

Im trying to get the rows of an excel document. What i have achieved.

1-. Retrieve .xls, .xlsx files
2-. Convert those files to TIFF images
3-. Enhance image for better text recognition
4-. Identify Pages
5-. Create the Documents
6-. Recognize Page and Fields
7-. Populate Fields (this is were is my problem)

For example, in a table like

Name   | Age | Size
Juan   | 26  | 1.90m
Max    | 25  | 1.85m
Victor | 26  | 1.65m

My project can find the keyword Name, Age & Size, and in the settings i can tell him, ok the value is down a line and group the leading and trailing words, but it will only fill the fields name, age and size with the first values below and will ignore the others, and datacap does not seems to have a field array type.

enter image description here

In the image, you can see that there is only one way add fields, and they are scalar (just one value), Add multiple only adds multiple fields at once, not a field of multiple values haha.

This is how my fields get retrieved

enter image description here

Another problem i face is that my excel sheet gets splitted in order to fill a document format, and i was expecting the whole sheet to be converted in 1 document not 4

enter image description here

In the image, those 4 pages are from the same sheet (in the excel)

IBM docs still lacks information, there are some pages that only has its title and zero information lol.

YouHaveNoSkill
  • 53
  • 1
  • 11

1 Answers1

0

agreed for point 1, it does not support any field like array or something which is more of a advanced level. This feature is really needed and we may see something from IBM going ahead.

Coming back to second point, datacap will be converting the excel according to the print pages like when you print that excel. you have to add the ruleset to merge those in single file.. The most common way to do that is to use tiffmerge ootb given by datacap.

Krunal Barot
  • 914
  • 6
  • 17
  • Hi, i managed to create a single Excel file using SpireXLS C# Library, it lets me create images from XLS, i solved that issue with that library using Custom Actions in c#, then loaded that generated image in Datacap, you are right, it creates the excel like it is going to be printed. The issue is that it will only work for excels, for PDF they are already in print format, wich make it difficult to know where a table ends of if the second page is part of the table and if they are leading rows or leading columns. – YouHaveNoSkill Feb 06 '19 at 19:11
  • I solved the first problem using the FindTableValueRegex, wich lets me find a Header, then the value i want then another Header where the data will be extracted, using Regex. That way i realised it is possible to create Fields Within Fields because it generated something like (Document > Page > row1 > name | age row2 > name | age) row, name and age are fields. – YouHaveNoSkill Feb 06 '19 at 19:11
  • cool. the first problem is solved.. for the pdf thing you mentioned - OCR have to be used and there is a sample project(app shipped with datacap) which did mention to use OCR/OMR in case of the 2 page split for the table. – Krunal Barot Feb 07 '19 at 09:33
  • If this answer is useful, please consider to tick it as answer. thanks – Krunal Barot Feb 07 '19 at 12:08
  • I will give it a try to see if the example app can solve that kind of problems, i think there should be some type of AI to solve ambiguos cases like this, how can we know if the next page are more rows or more columns in addition to my last page, or... when the next page have headers, is it a different table or part of the last table?. – YouHaveNoSkill Feb 07 '19 at 16:39
  • yes, the latest version of datacap with add on of cognitive ability is providing that feature.. but it is way to expensive as it is using watson skills – Krunal Barot Feb 08 '19 at 10:52