0

Raw Data: Given is a PDF data containing the student placement details of a university. It is in a completely unstructured form and needs to be cleaned up before processing.

The Expected CSV file output:

I tried importing the pdf from inside an excel spreadsheet. Tried converting it to .xlsx and then cleansing. They still resulted in unstructured data.

I do not have any prior experience working with power queries, web queries or scraping data.

Suggest all possible methods to clean the data and put it into a CSV file. It would be great to get a step-by-step procedure of what needs to be done, the tools and frameworks to be used in order to obtain the desired results.

Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24
  • PDF isn't a data format or even a document format, it's a file containing printer commands, specifically Postscript. It has no structure. It doesn't even have tables. Libraries like pypdf and tabula-py can try to extract text from pages, by looking at adjacent text sections and guessing rows and columns from grid lines. Without grid lines you'll have to tell give the tool the coordinates of tables and columns in each page. You could still end up with header and footer text in your data. – Panagiotis Kanavos May 17 '23 at 14:46
  • Worse, since PDF has no tables, nothing says that data should flow horizontally. I've seen PDF documents where trying to select text from a table would select the *column*, not the row, because the "table" was generated vertically instead of horizontally – Panagiotis Kanavos May 17 '23 at 14:48
  • If you could provide us with a example of the pdf file than someone can take a look at it. You can also try something like [Cronoscan](https://www.chronoscan.org/) - which is using templates to extract data with OCR and export it to different output formats. – DTNGNR May 17 '23 at 16:26
  • Take a look at https://pypi.org/project/pypdf/. Depending on the structure of the presentation you may be able to parse out the fields. – Stephen Boston May 18 '23 at 03:00

0 Answers0