4

Lets say I have an image with a bar graph such as below:

source: statista.com

I want to extract the value of the bar and the label, is there a way to do this apart from training ML models?

I have a bunch of images for which I generated graphs and some descriptions. I am currently trying to extract information just from the description which I am able to do, but I realised the information there is limited. So I would like to also extract information from the images. Is it possible to achieve this in the simplest way possible?

Some references to look through would be very helpful. Language preference is Python. I have no knowledge of how to manipulate images.

Note: the images and descriptions are the ones I created.

Georgy
  • 12,464
  • 7
  • 65
  • 73
Koo
  • 45
  • 6
  • How did you create this plot? If, for example, you used matplotlib, then you could extract data like here: [Retrieve XY data from matplotlib figure](https://stackoverflow.com/questions/20130768/retrieve-xy-data-from-matplotlib-figure). – Georgy Jul 02 '18 at 12:22
  • I created the plots using tableau, I dont have the original data anymore as we had to delete them. – Koo Jul 02 '18 at 12:58
  • Maybe there is something helpful here: https://www.researchgate.net/post/How_can_I_extract_the_values_of_data_plotted_in_a_graph_which_is_available_in_pdf_form or here: https://stackoverflow.com/questions/1657941/plot-digitization-scraping-sample-values-from-an-image-of-a-graph – Georgy Jul 02 '18 at 13:38
  • If you want to extract pure statistical information from bar graphs. I think all you need is [plotdigitizer](https://plotdigitizer.com/). – Anonymous Apr 09 '21 at 11:36

1 Answers1

6

If the original code that generated the plot is unavailable, install tesseract, and then PIL and pytesseract.

sudo apt-get install tesseract-ocr

sudo -H pip3 install pillow pytesseract

You'll probably also want to download the French datafiles and put them at /usr/share/tesseract-ocr/tessdata.

I saved your image as chart.png and then I wrote the code below.

import pytesseract
from PIL import Image
img = Image.open('chart.png')

print(pytesseract.image_to_string(Image.open('chart.png'),lang='fra'))

This is the output.

Château d’AzayflefRideau

Château et musée de Blois

Château des Bau>«dæProvence
Crypte archéologique de NotræDame
Théâtre antique et musée d’Orange

Château d’Angers

Château des ducs de Bretagne, musée
d'histoire de Nantes

281

271

258

223

197

184

180

2 000

4 000 6 000 8 000
Number of V|s|tors ln thousands

10 000

12

If all your images follow exactly this same format, now all we have to do is to make this readable.

import pytesseract
from PIL import Image
import re
img = Image.open('chart.png')
s = pytesseract.image_to_string(img,lang='fra')
y_axis = s.split('\n')
y_axis = [x for x in s if x.isdigit()]
x_axis = s.split('\n\n')
x_axis = [x for x in x_axis if x[0].isalpha()]
x_axis = '\n'.join(x_axis)
x_axis = re.split('(\n[A-Z])',x_axis)
x_axis = [x_axis[0]] + [ ''.join(x) for x in zip(x_axis[1:][0::2],x_axis[1:][1::2]) ]
x_axis = [x.rstrip('\n') for x in x_axis]
x_axis = [x.lstrip('\n') for x in x_axis]
x_axis = [ re.sub('\n',' ',x) for x in x_axis]
y_axis = y_axis[0:len(x_axis)]
result = list(zip(x_axis,y_axis))
print(result)

And now you have:

[('Château d’AzayflefRideau', '281'), ('Château et musée de Blois', '271'), ('Château des Bau>«dæProvence', '258'), ('Crypte archéologique de NotræDame', '223'), ('Théâtre antique et musée d’Orange', '197'), ('Château d’Angers', '184'), ("Château des ducs de Bretagne, musée d'histoire de Nantes", '180')]

This code can get simpler if you split the image in two before passing it to pytesseract ( one for the labels on the left side, other for the bars and numbers).

Ruan
  • 772
  • 4
  • 13