Train EAST text detector on custom data

Question

How can I train the EAST text detector on my custom data. There aren't any blogs online that shows step by step procedure to do the same. What I have currently.

I have a folder that contains all the images and corresponding xml file for each of our images that tells where our text are located.

Example :

<annotation>
    <folder>Dataset</folder>
    <filename>FFDDAPMDD1.png</filename>
    <path>C:\Users\HPO2KOR\Desktop\Work\venv\Patent\Dataset\Dataset\FFDDAPMDD1.png</path>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>839</width>
        <height>1000</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>text</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>522</xmin>
            <ymin>29</ymin>
            <xmax>536</xmax>
            <ymax>52</ymax>
        </bndbox>
    </object>
    <object>
        <name>text</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>510</xmin>
            <ymin>258</ymin>
            <xmax>521</xmax>
            <ymax>281</ymax>
        </bndbox>
    </object>
    <object>
        <name>text</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>546</xmin>
            <ymin>528</ymin>
            <xmax>581</xmax>
            <ymax>555</ymax>
        </bndbox>
    </object>
    <object>
        <name>text</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>523</xmin>
            <ymin>646</ymin>
            <xmax>555</xmax>
            <ymax>674</ymax>
        </bndbox>
    </object>
    <object>
        <name>text</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>410</xmin>
            <ymin>748</ymin>
            <xmax>447</xmax>
            <ymax>776</ymax>
        </bndbox>
    </object>
    <object>
        <name>text</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>536</xmin>
            <ymin>826</ymin>
            <xmax>567</xmax>
            <ymax>851</ymax>
        </bndbox>
    </object>
    <object>
        <name>text</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>792</xmin>
            <ymin>918</ymin>
            <xmax>838</xmax>
            <ymax>945</ymax>
        </bndbox>
    </object>
</annotation>

Also I have the parsed xml file for each one of my images in the format which is used to train YOLO models.

Example

C:\Users\HPO2KOR\...\text\FFDDAPMDD1.png 522,29,536,52,0 510,258,521,281,0 546,528,581,555,0 523,646,555,674,0 410,748,447,776,0 536,826,567,851,0 792,918,838,945,0 660,918,706,943,0 63,1,108,24,0 65,51,110,77,0 65,101,109,126,0 63,151,110,175,0 63,202,109,228,0 63,252,110,276,0 63,303,110,330,0 62,353,110,381,0 65,405,109,434,0 90,457,110,482,0 59,505,101,534,0 64,565,107,590,0 61,616,107,644,0 62,670,103,694,0 62,725,104,753,0 63,778,104,804,0 62,831,100,857,0 87,887,106,912,0 98,919,144,943,0 240,916,284,943,0 378,915,420,943,0 520,918,565,942,0
C:\Users\HPO2KOR\...\text\FFDDAPMDD2.png 91,145,109,171,0 68,192,106,218,0 92,239,111,265,0 69,286,108,311,0 92,333,107,357,0 66,379,110,405,0 90,424,111,451,0 69,472,107,497,0 91,518,109,545,0 66,564,109,590,0 90,613,110,637,0 121,644,140,670,0 279,643,322,671,0 446,645,490,668,0 615,642,661,669,0 786,643,831,667,0 954,643,997,672,0 820,22,866,50,0 823,73,866,103,0
C:\Users\HPO2KOR\...\text\FFDDAPMDD3.png 648,1,698,30,0 68,64,129,91,0 55,144,128,168,0 70,218,129,247,0 56,300,127,326,0 71,377,125,404,0 58,459,127,482,0 109,535,130,560,0 140,568,160,594,0 344,568,382,594,0 563,566,581,591,0 760,568,800,593,0 982,569,1000,591,0

What is the procedure to train this EAST text detector for my custom dataset. I am on windows.

Okay. Not sure the best route (I'm not on Windows and it always seems to have idiomatic Conda problems), but one thing to try is creating a dedicated folder in your user directory as a package cache, then add it to the `pkgs_dirs` configuration option (`conda config --add pkgs_dirs your_dir`). You may need to manually add the default cache(s) back since I think customizing excludes them (i.e., check `conda config --show pkgs_dirs` before and after). — merv, Jan 13 '20 at 06:39

score 1 · Accepted Answer · answered Mar 07 '20 at 18:29

According to the documentation in the readme file, custom training the keras implementation of EAST requires a folder of images with an accompanying text file for each image named gt_IMAGENAME.txt. (Replace IMAGENAME with the name of the image it maps to.)

In each text file, "the ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word's bounding box and its transcription in a comma separated format." This quotation is from https://rrc.cvc.uab.es/?ch=4&com=tasks, which is linked in the read me to the tensorflow implementation of EAST at https://github.com/argman/EAST. The bounding box is expressed as coordinates for the four corners.

You seem to have all the information you need to construct training data in the right format. There could be a tool out there to convert everything, but a quick python script will work just fine as well. Something like ...

Loop all xml files
For each xml file, create a text file named as the documentation requires
Use BeautifulSoup to parse xml
Use find_all to get all object tags
Use the xmin, xmax, ymin, and ymax values to express the x,y coordinates of all corners. (upper-left is xmin, ymax; upper-right is xmax, ymax; etc.) The order, based on https://github.com/argman/EAST/blob/master/training_samples/img_1.txt, appears to be lower-left, lower-right, upper-right, upper-left
For each object tag, write a new line in the text file with the following format: x1, y1, x2, y2, x3, y3, x4, y4, transcription or x1, y1, x2, y2, x3, y3, x4, y4, ### (followed by a \n for newline)
run python train.py with all the command line arguments set the way the "execution example" is setup, but change the value after --training_data_path= to your path

Can you explain why we have to pass ### if we don't want to complete the transcription ? Why an empty value is not ok ? Thanks — LCMa, Jun 08 '21 at 06:41
### is me at to indicate - ignore the content or even text (if there is one with some fonts not of your choice) within this bounding box. Putting it empty means that nothing is there, but in fact there is something but you want the algorithm to ignore. Putting empty instead of ### will train the model that such fields are empty. — everestial007, Oct 28 '22 at 13:46

Train EAST text detector on custom data

1 Answers1