How to create an OCR dataset?

Question

I'm just a beginner in Machine learning. I've just learnt supervised machine learning so far with some basic image classification and regression problem. I've just done an image classification problem with sklearn load_digits() which has about 1800 images of the characters from 0-9 (description of the dataset) . What I want to do is to make my own dataset instead of loading it from sklearn like:

from sklearn.datasets import load_digits

I want to use my own dataset. So can someone guide me can I make my own dataset in CSV or any other format so that I can use it in my supervised machine learning technique ?

score 1 · Answer 1 · answered Nov 17 '20 at 06:16

First thing would be to understand your use case. There is difference between OCR and Image Classification tasks. Lets look at both of the scenarios.

Image Classification : The task is similar to standard supervised tasks that you might have seen in ML only in this case we classify image instead of data in a sheet. Data Curation is one of the major tasks involved in image classification and complete accuracy depends upon how you processed your data. lets say given an image you want to identify if its a dog or a cat. This would require you to collect at least 500 images each of different types of dogs and cat. You can also artificially create the image by taking an image of a dog and then use python OpenCV library to add some noise or rotation and save the updated image. This way you can collect more images in short span of time. Once you have the images for all the categories you want to classify ( dogs and cats ), you can then go for model selection. CNN (Convolutional Neural Network) are considered to be best for image classification tasks but creating them from scratch and tuning them could take long time. My advise would be to use Tensorflow Object Detection API the provides a good framework for beginners to built their own image classifier or object detector with many pre-trained models to choose from. https://github.com/tensorflow/models/tree/master/research/object_detection
OCR : OCR is one of the complex application of image classification and its not that easy to built from scratch. In the example you mentioned in your question, though it looks like an OCR but its more or less an image classification task, since you have a single image of each character that you are trying to classify. In real world OCR would involve handwritten notes and extracting the text written in them to your system which is a complicated process. There are some prebuilt libraries like Tesseract that specializes in OCR, by taking the input image with text written on it and it returns the text present in the image in string format. However, these libraries fails when it comes to handwritten text as those are much difficult to read. If you are interested in building an OCR system from scratch it would require you great deal of image processing tasks. Lets say you have an image on which there is a phone number written by someone. You OCR system would first have to detect each numbers separately by drawing detection boxes around each number in the image (you can use tensorflow object detection system api mentioned above) but lets say you have an image of both alphabets and numbers and symbols, this would then be complex tasks to first collect individual images of each alphabet , numbers and symbols which could be tough. My advise again would be to use API which are free and also much accurate. I used Microsoft Cognitive Vision API that has an OCR function to detect any type of text from an image. This would reduce your effort to only properly cleaning the image.

Thanks @Rohan for your explanation.I can clearly understand your words.But I was eager to know can I make dataset in csv file for any kind of image classification task ? if yes then how ? — Koushik, Nov 17 '20 at 06:54
@Koushik If you are working on grayscale images you can open those images in python. Python would generate the numpy array of pixel values. you can then export this numpy array to csv format. lets say you have 10 images. You can initiate a loop where the program reads one image at a time from the directory and stores it as numpy array. you can append this numpy array as a row in overall data frame. As the loop progresses each image is read as numpy array and is appended as row in dataframe. In the end you can export this dataframe as csv format. — Rohan Khurana, Nov 18 '20 at 08:03
Thanks Rohan for your answer. Thanks a lot bro.One more thing bro, actually I'm a beginner in ML and I need some guidance of experts.I don't know how you will you react, bro can I have your email Id so that I can contact you if I get into any trouble.It will be a great help for me. — Koushik, Dec 07 '20 at 06:33

How to create an OCR dataset?

1 Answers1