You can use tesseract output in HOCR to retain positional information. Converting these kinds of documents directly into text retaining positional information is a very tricky and hard problem. I can give you an intermediate solution that can give you a data frame with each word and its coordinates so that you can parse it to extract key-value information using the coordinates.
### this will save the tesseract output as "demo.hocr"
pytesseract.pytesseract.run_tesseract(
"demo.jpg", "demo",
extension='.html', lang='eng', config="hocr")
HOCR is an HTML like representation that contains a lot of metadata like line information, word information, its coordinates, etc present.
For better handling, I have a parser that will directly parse it and give you a data frame with words and its coordinates.
I have created a package in pip called tesseract2dict for this.
You can easily install it using pip install tesseract2dict
This is how you can use that.
import cv2
from tesseract2dict import TessToDict
td=TessToDict()
inputImage=cv2.imread('path/to/image.jpg')
### function 1
### this is for getting word level information as a dataframe
word_dict=td.tess2dict(inputImage,'outputName','outfolder')
### function 2
### this is for getting plain text for a given coordinates as (x,y,w,h)
text_plain=td.word2text(word_dict,(0,0,inputImage.shape[1],inputImage.shape[0]))
PS: This package is only compatible with Tesseract 5.0.0