If the original code that generated the plot is unavailable, install tesseract, and then PIL and pytesseract.
sudo apt-get install tesseract-ocr
sudo -H pip3 install pillow pytesseract
You'll probably also want to download the French datafiles and put them at /usr/share/tesseract-ocr/tessdata
.
I saved your image as chart.png and then I wrote the code below.
import pytesseract
from PIL import Image
img = Image.open('chart.png')
print(pytesseract.image_to_string(Image.open('chart.png'),lang='fra'))
This is the output.
Château d’AzayflefRideau
Château et musée de Blois
Château des Bau>«dæProvence
Crypte archéologique de NotræDame
Théâtre antique et musée d’Orange
Château d’Angers
Château des ducs de Bretagne, musée
d'histoire de Nantes
281
271
258
223
197
184
180
2 000
4 000 6 000 8 000
Number of V|s|tors ln thousands
10 000
12
If all your images follow exactly this same format, now all we have to do is to make this readable.
import pytesseract
from PIL import Image
import re
img = Image.open('chart.png')
s = pytesseract.image_to_string(img,lang='fra')
y_axis = s.split('\n')
y_axis = [x for x in s if x.isdigit()]
x_axis = s.split('\n\n')
x_axis = [x for x in x_axis if x[0].isalpha()]
x_axis = '\n'.join(x_axis)
x_axis = re.split('(\n[A-Z])',x_axis)
x_axis = [x_axis[0]] + [ ''.join(x) for x in zip(x_axis[1:][0::2],x_axis[1:][1::2]) ]
x_axis = [x.rstrip('\n') for x in x_axis]
x_axis = [x.lstrip('\n') for x in x_axis]
x_axis = [ re.sub('\n',' ',x) for x in x_axis]
y_axis = y_axis[0:len(x_axis)]
result = list(zip(x_axis,y_axis))
print(result)
And now you have:
[('Château d’AzayflefRideau', '281'), ('Château et musée de Blois',
'271'), ('Château des Bau>«dæProvence', '258'), ('Crypte archéologique
de NotræDame', '223'), ('Théâtre antique et musée d’Orange', '197'),
('Château d’Angers', '184'), ("Château des ducs de Bretagne, musée
d'histoire de Nantes", '180')]
This code can get simpler if you split the image in two before passing it to pytesseract ( one for the labels on the left side, other for the bars and numbers).