0

I have extracted a tabular data using Camelot into pandas DataFrame. Now due to table indentation issues in pdf, string belonging to same row gets split into two parts(especially strings inside bullet points). I want to merge these spitted rows into single row.

I have highlighted how single row is split into two rows. (for "c)" bullet point and "V" bullet point) : enter image description here

I have also added expected output.enter image description here

I am not able to create a generalize logic for this. Can anyone suggest witty code to handle these cases?

Link to sample dataset : https://docs.google.com/spreadsheets/d/1xdhb1d5qWPhcF3mdS1F76FfMqgFLmZdonHmo9DKBUw0/edit#gid=0

Parth chokhra
  • 91
  • 1
  • 6
  • can u add a sample dataset as well? – Mehul Gupta Jun 24 '22 at 11:19
  • else, loop over the dataset using df.iterrows(), wherever you have a null Note/year value, add such a row with the next row (index + 1) – Mehul Gupta Jun 24 '22 at 11:20
  • Anyway of identifying whether string has any type of bullet point or not? – Parth chokhra Jun 24 '22 at 11:36
  • what if .pdf extraction code could be adjusted to extract data properly in the first place? – NoobVB Jun 24 '22 at 11:39
  • There doesn't seem to be an easy way to identify the unwanted row breaks? The EXPENSES row should not be merged but it looks the same as the Profit before interest row in terms of missing values? Find a working rule for merging and people will be able to answer that in pandas. – creanion Jun 24 '22 at 11:40

0 Answers0