Merging cells, in the same column, in the same df- Python

Question

I am attempting to merge two cells together. The reason for this is due to the fact that every unit under 'Chassis' should be an alphanumeric (ABCD123456) however the PO provided occasionally shifts the last number to the next row (no other data on said row) making the data look like this Example I initially tried to create a statement that looked at the cell, confirmed it was less one number, then would look at the next cell, and merge the two. Never got that to even come close to manifesting any results. I then decided to replicate the data frame, shift the second data frame(so the missing number is on the same row), and merge them together. This is where I am now. Error Msg This is my first real bit of code in Python so I am fairly certain I am doing inefficient things so by all means let me know where I can improve.

Currently I have this...

Col1	Chassis	Other Columns...	Other Columns 2...
Nan	ABCD12345	ABC	123
Nan	6	Nan	Nan
Nan	WXYZ987654	GHI	456
Nan	QRSTU654987	Nan	789
Nan	MNOP999999	XYZ	Nan

End Goal is this...

Col1	Chassis	Other Columns...	Other Columns 2...
Nan	ABCD123456	ABC	123
Nan	WXYZ987654	GHI	456
Nan	QRSTU654987	Nan	789
Nan	MNOP999999	XYZ	Nan

import PyPDF2 as pdf2
import tabula as tb
import pandas as pd
import re
import csv
import os

os.listdir()
pd.set_option('display.max_columns', None)

#bring in pdf, remove first page, convert to csv
PO = 'PO.pdf'
pages = open(PO, 'rb')
readPDF = pdf2.PdfFileReader(pages)
totalpages = readPDF.numPages
x = '2-' + str(totalpages)
POCSV = tb.convert_into(PO, 'POCSV.csv', output_format = 'csv', pages = x)

#Convert column to string, create second data frame, shift said data frame up 1
df = pd.read_csv('POCSV.csv')
df['Chassis'] = df['Chassis'].astype(str)
dfshift = df.shift(-1)
dfshift.rename(columns=({'Chassis': 'Chassis Shifted'}), inplace = True,)
dfMerged = pd.concat([df, dfshift], axis=1)

#For each row combine rows, create new column
for ind, row in df.iterrows():
    dfMerged.loc[ind, 'Complete Chassis'] = row['Chassis'] + row["Chassis Shifted"]
    
print(dfMerged['Complete Chassis'])

Corralien · Accepted Answer · 2022-01-26T15:29:29.663

0

Create a virtual group and merge rows of this group for Chassis column:

# Convert 'NaN' string to pd.NA
df = df.replace('Nan', pd.NA)

cols = df.columns.difference(['Chassis'])
m = df[cols].any(1)
df = df.assign(Chassis=df.groupby(m.cumsum())['Chassis'] \
       .transform('sum')).loc[m].reset_index(drop=True)
print(df)

# Output
   Col1      Chassis Other Columns Other Columns 2
0  <NA>   ABCD123456           ABC             123
1  <NA>   WXYZ987654           GHI             456
2  <NA>  QRSTU654987          <NA>             789
3  <NA>   MNOP999999           XYZ            <NA>

Setup:

import pandas as pd
import numpy as np

data = {'Col1': ['Nan', 'Nan', 'Nan', 'Nan', 'Nan'],
        'Chassis': ['ABCD12345', '6', 'WXYZ987654', 'QRSTU654987', 'MNOP999999'],
         'Other Columns': ['ABC', 'Nan', 'GHI', 'Nan', 'XYZ'],
         'Other Columns 2': ['123', 'Nan', '456', '789', 'Nan']}
df = pd.DataFrame(data)
print(df)

# Output
  Col1      Chassis Other Columns Other Columns 2
0  Nan    ABCD12345           ABC             123
1  Nan            6           Nan             Nan
2  Nan   WXYZ987654           GHI             456
3  Nan  QRSTU654987           Nan             789
4  Nan   MNOP999999           XYZ             Nan

edited Jan 26 '22 at 15:29

answered Jan 26 '22 at 04:32

Corralien

109,409
8
28
52

This works for your example df but on my actual data it is giving an Empty DataFrame. I think it has to do with Nan always existing on Col1 and randomly throughout the data. I've updated my tables above to better reflect this visually. – WinterT Jan 26 '22 at 12:24
@WinterT. I updated my answer. I think 'Nan' is a string and not `np.nan`. Can you try it, please? – Corralien Jan 26 '22 at 13:01
1

That did it! I really do appreciate your assistance. I was doing all sorts of crazy things to try and get a similar output. – WinterT Jan 26 '22 at 13:38
1

Wanted to add that with this method the index is no longer in proper sequential order. To fix this add this line of code before your print or save...df.reset_index(drop=True, inplace=True) #reset index – WinterT Jan 26 '22 at 15:01
I updated my answer according your comment. – Corralien Jan 26 '22 at 15:29

Merging cells, in the same column, in the same df- Python

1 Answers1