0

I have a PDF file, I need to convert it into a CSV file this is my pdf file example as link https://online.flippingbook.com/view/352975479/ the code used is

import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0

with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            print(line)

with the above script I am not getting proper output, For Time column "AM" is getting in the next line. The output I am getting is like this

[1]: https://i.stack.imgur.com/25Yxc.png

K J
  • 8,045
  • 3
  • 14
  • 36
  • PDF files don't contain information about how text is supposed to be ordered. All it has is the position of the text on the page. For a tool like pdfplumber, all it can do is guess what order it ought to go in by looking at the X and Y coordinates of the text on the page. – Nick ODell Nov 26 '22 at 04:28
  • In this specific case, the output has a pattern, so you could maybe fix it by looking for lines with the content "AM" or "PM", and adding them to the end of the third field of the previous line, after splitting by whitespace. – Nick ODell Nov 26 '22 at 04:29

2 Answers2

2

It may help you to see how the surface of a pdf is displayed to the screen. so that one string of plain text is placed part by part on the display. (Here I highlight where the first AM is to be placed.

enter image description here

As a side issue that first AM in the file is I think at first glance encoded as this block

BT
/F1 12 Tf
1 0 0 1 224.20265 754.6322 Tm
[<001D001E>] TJ
ET

Where in that area 1D = A and 1E = M

So If you wish to extract each LINE as it is displayed, by far the simplest way is to use a library such as pdftotext that especially outputs each row of text as seen on page.

Thus using an attack such as tabular comma separated you can expect each AM will be given its own row. Which should by logic be " ",AM," "," " but some extractors should say nan,AM,nan,nan

As text it looks like this from just one programmable line

pdftotext -layout "Battery Voltage.pdf"

That will output "Battery Voltage.txt" in the same work folder

enter image description here

Then placing that in a spreadsheet becomes enter image description here

Now we can export in a couple of clicks (no longer) as "proper output" csv along with all its oddities that csv entails.

,,Battery Vo,ltage,




Sr No,DateT,Ime,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off
,AM,,,
2,01/11/2022,00:23:10,47.15,Off
,AM,,,
3,01/11/2022,00:38:10,47.15,Off
,AM,,,
4,01/11/2022,00:58:10,47.15,Off
,AM,,,
5,01/11/2022,01:18:10,47.15,Off
,AM,,,
6,01/11/2022,01:33:10,47.15,Off
,AM,,,
7,01/11/2022,01:48:10,47.15,Off
,AM,,,
8,01/11/2022,02:03:10,47.15,Off
,AM,,,
9,01/11/2022,02:18:10,47.15,Off
,AM,,,
10,01/11/2022,02:37:12,47.15,Off
,AM,,,

So, if the edits were not done before csv generation it is simpler to post process in an editor, like this html page (no need for more apps)

,,Battery,Voltage,
Sr No,Date,Time,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off,AM,,,
2,01/11/2022,00:23:10,47.15,Off,AM,,,
3,01/11/2022,00:38:10,47.15,Off,AM,,,
4,01/11/2022,00:58:10,47.15,Off,AM,,,
5,01/11/2022,01:18:10,47.15,Off,AM,,,
6,01/11/2022,01:33:10,47.15,Off,AM,,,
7,01/11/2022,01:48:10,47.15,Off,AM,,,
8,01/11/2022,02:03:10,47.15,Off,AM,,,
9,01/11/2022,02:18:10,47.15,Off,AM,,,
10,01/11/2022,02:37:12,47.15,Off,AM,,,

Then on re-import it looks more human generated

enter image description here

In discussions it was confirmed all that's desired is a means to a structured list and first parse using
pdftotext -layout -nopgbrk -x 0 -y 60 -W 800 -H 800 -fixed 6 "Battery Voltage.pdf" &type "battery voltage.txt"|findstr "O">battery.txt

will output regulated data columns for framing, with a fixed headline or splitting or otherwise using cleaned data.

                 1            01-11-2022 00:08:10         47.15                 Off
                 2            01-11-2022 00:23:10         47.15                 Off
                 3            01-11-2022 00:38:10         47.15                 Off
                 4            01-11-2022 00:58:10         47.15                 Off
                 5            01-11-2022 01:18:10         47.15                 Off
...
               32357          24-11-2022 17:48:43         45.40                  On
               32358          24-11-2022 17:48:52         44.51                  On
               32359          24-11-2022 17:48:55         44.51                  On
               32360          24-11-2022 17:48:58         44.51                  On
               32361          24-11-2022 17:48:58         44.51                  On

At this stage we can use text handling such as csv or add json brackets

for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do echo csv is "%%a,%%b,%%c,%%d,%%e">output.txt
...
csv is "32357,24-11-2022,17:48:43,45.40,On"
csv is "32358,24-11-2022,17:48:52,44.51,On"
csv is "32359,24-11-2022,17:48:55,44.51,On"
csv is "32360,24-11-2022,17:48:58,44.51,On"
csv is "32361,24-11-2022,17:48:58,44.51,On"

So the request is for JSON (not my forte so you may need to improve on my code as I dont know what mongo expects)

here I drop a pdf onto a battery.bat

enter image description here

{"line_id":1,"created":{"date":"01-11-2022"},{"time":"00:08:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":2,"created":{"date":"01-11-2022"},{"time":"00:23:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":3,"created":{"date":"01-11-2022"},{"time":"00:38:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":4,"created":{"date":"01-11-2022"},{"time":"00:58:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":5,"created":{"date":"01-11-2022"},{"time":"01:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":6,"created":{"date":"01-11-2022"},{"time":"01:33:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":7,"created":{"date":"01-11-2022"},{"time":"01:48:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":8,"created":{"date":"01-11-2022"},{"time":"02:03:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":9,"created":{"date":"01-11-2022"},{"time":"02:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":10,"created":{"date":"01-11-2022"},{"time":"02:37:12"},{"Voltage":"47.15"},{"State","Off"}}

it is a bit slow as running in pure console so lets run it blinder by add @, it will still take time as we are working in plain text, so do expect a significant delay for 32,000+ lines = 2+1/2 minutes on my kit

pdftotext -layout -nopgbrk -x 0 -y 60 -W 700 -H 800 -fixed 8 "%~1" battery.txt

echo Heading however you wish it for json perhaps just opener [ but note only one redirect chevron >"%~dpn1.txt"

for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do @echo  "%%a": { "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },>>"%~dpn1.txt"
REM another json style could be  { "Line_Id": %%a, "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },
REM another for an array can simply be [%%a,"%%b","%%c",%%d,"%%e" ],

echo Tailing however you wish it for json perhaps just final closer ] but note double chevron >>"%~dpn1.txt"

enter image description here

To see progress change @echo { to @echo %%a&echo {

Thus, after a minute or so enter image description here however, it tends to add an extra minute for all that display activity! before the window closes as a sign of completion.

K J
  • 8,045
  • 3
  • 14
  • 36
1

For cases like these, build a parser that converts the unusable data into something you can use.

Logic below converts that exact file to a CSV, but will only work with that specific file contents.

Note that for this specific file you can ignore the AM/PM as the time is in 24h format.

import pdfplumber


file = "Battery Voltage.pdf"
skiplines = [
    "Battery Voltage",
    "AM",
    "PM",
    "Sr No DateTIme Voltage (v) Ignition",
    ""
]


with open("output.csv", "w") as outfile:
    header = "serialnumber;date;time;voltage;ignition\n"
    outfile.write(header)
    with pdfplumber.open(file) as pdf:
        for page in pdf.pages:
            for line in page.extract_text().split('\n'):
                if line.strip() in skiplines:
                    continue
                outfile.write(";".join(line.split())+"\n")

EDIT

So, JSON files in python are basically just a list of dict items (yes, that's oversimplification).

The only thing you need to change is the way you actually process the lines. The actual meat of the logic doesn't change...

import pdfplumber
import json


file = "Battery Voltage.pdf"
skiplines = [
    "Battery Voltage",
    "AM",
    "PM",
    "Sr No DateTIme Voltage (v) Ignition",
    ""
]
result = []


with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        for line in page.extract_text().split("\n"):
            if line.strip() in skiplines:
                continue
            serialnumber, date, time, voltage, ignition = line.split()
            result.append(
                {
                    "serialnumber": serialnumber,
                    "date": date,
                    "time": time,
                    "voltage": voltage,
                    "ignition": ignition,
                }
            )

with open("output.json", "w") as outfile:
    json.dump(result, outfile)
Edo Akse
  • 4,051
  • 2
  • 10
  • 21