0

I have PDF format table enter image description here

And i need to create Data Frame from it. I use pdfplumber module and when i try to create DataFrame i get:

       0                   1     2                     3   \
0   Oil Company                None  None                Target
1          None                None  None  2022-23 \n(Apr-Mar)*
2          None                None  None                  None
3          None                None  None                  None
4          None                None  None                  None
5          None                None  None                  None
6                              ONGC                    19869.61
7          None  (Nomination Block)  None                  None
8                               OIL                     3571.00
9          None  (Nomination Block)  None                  None
10                          Pvt/JVs                     7400.88
11         None    (PSC/RSC Regime)  None                  None
12        Total                None  None              30841.49

                   4        5        6       7     8     9   \
0   September (Month)     None     None    None  None  None
1             2022-23     None  2021-22             %
2                None     None     None    None  over  None
3             Target*   Prod.*    Prod.    None  None  None
4                None     None     None    None  last  None
5                None     None     None    None  year  None
6             1584.31  1599.47  1642.76   97.36  None  None
7                None     None     None    None  None  None
8              245.98   258.83   253.51  102.10  None  None
9                None     None     None    None  None  None
10             613.07   528.09   623.38   84.71  None  None
11               None     None     None    None  None  None
12            2443.36  2386.38  2519.65   94.71  None  None

                              10        11        12      13    14    15
0   April-September (Cumulative)      None      None    None  None  None
1                        2022-23      None   2021-22             %
2                           None      None      None    None  over  None
3                        Target*    Prod.*     Prod.    None  None  None
4                           None      None      None    None  last  None
5                           None      None      None    None  year  None
6                       10041.75   9831.23   9704.15  101.31  None  None
7                           None      None      None    None  None  None
8                        1676.64   1559.34   1495.09  104.30  None  None
9                           None      None      None    None  None  None
10                       3582.34   3343.65   3725.74   89.74  None  None
11                          None      None      None    None  None  None
12                      15300.73  14734.22  14924.98   98.72  None  None

It turns out wrong. How can i unite 6th and 7th, 8th and 9th, 10th and 11th rows from 2-nd column and move it to the 1-st column? Please, dont suggest to connect with publisher to ask another format of this data. I need this data from this PDF file.

Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
Trepetaky
  • 45
  • 3
  • That's pretty darned close, don't you think? You just need a little massaging of the data. Combine column 0 and 1, delete column 2, and data is all there. – Tim Roberts Nov 16 '22 at 20:57
  • @TimRoberts Maybe,bro. Im quite new in pandas, so i even dont know how to make correct request in the internet to get correct annswer – Trepetaky Nov 16 '22 at 21:01
  • @TimRoberts Advise how how can i unite 6th and 7th, 8th and 9th, 10th and 11th rows – Trepetaky Nov 16 '22 at 21:07
  • It might be easier to do this with the data you get from pdfplumber before taking it into pandas. Have you looked at that data? – Tim Roberts Nov 16 '22 at 21:08
  • @TimRoberts Yes, it's the list in list:) – Trepetaky Nov 16 '22 at 21:14
  • These data which i need to unite are in differents lists. I think its not the easiest and fastest way – Trepetaky Nov 16 '22 at 21:17
  • You can drop the odd rows using `df.drop( index=[7,9,11], inplace=True)`. That's a start. – Tim Roberts Nov 16 '22 at 21:21
  • If your table spans one page, then you can try [`tabula`](https://tabula-py.readthedocs.io/en/latest/tabula.html#high-level-interfaces) this directly gives you output in `DataFrame` – Mr. Hobo Nov 17 '22 at 06:46

0 Answers0