0

I have data that looks like this

import pandas as pd
current_df = pd.DataFrame({ 'Col1':['Something (Something', 'Else) words', 'x', 'y', 'something (another', 'word) blah'], 
                    'Col2':['(some value)', '', 'some value', 'a', 'some value',''],
                    'Col3':['Something (Something', 'Else) words', '(x)', 'y', 'something (another', 'word) blah'], 
                    'Col4':['some value', '', 'some value', 'a', 'some value',''],
                    'Col5':['some value', '', 'some value', 'a', 'some value','']})

                   Col1          Col2                  Col3        Col4        Col5
0  Something (Something  (some value)  Something (Something  some value  some value
1           Else) words                         Else) words
2                     x    some value                   (x)  some value  some value
3                     y             a                     y           a           a
4    something (another    some value    something (another  some value  some value
5            word) blah  

I scraped this from a PDF and in some cases, there was a weird formatting issue where something would carry over to another line. It is always in the same two columns, and there is nothing else in those columns, just as I have shown. Is there a way that I could merge those values with the row above? Desired output is below.

In the example, the data I want to join is split by ( in one row and ) in the next. This is true for about 99% of my data, and I was thinking about trying to leverage that. But if there is a cleaner way to just merge the cells up, please let me know.

goal_df = pd.DataFrame({ 'Col1':['Something (Something Else) words', 'x', 'y', 'something (another word) blah'], 
                    'Col2':['(some value)', 'some value', 'a', 'some value'],
                    'Col3':['Something (Something Else) words', '(x)', 'y', 'something (another word) blah'], 
                    'Col4':['some value', 'some value', 'a', 'some value'],
                    'Col5':['some value', 'some value', 'a', 'some value'],})

                               Col1          Col2                              Col3        Col4        Col5
0  Something (Something Else) words  (some value)  Something (Something Else) words  some value  some value
1                                 x    some value                               (x)  some value  some value
2                                 y             a                                 y           a           a
3     something (another word) blah    some value     something (another word) blah  some value  some value

pkpto39
  • 545
  • 4
  • 11
  • Going back a step, did you look into another method of scraping? Maybe this method helps..?.. [convert from pdf to text: lines and words are broken](https://stackoverflow.com/questions/55220455/convert-from-pdf-to-text-lines-and-words-are-broken) – MDR Aug 06 '21 at 00:14
  • I'm scraping from a formatted table. Its about 200 pages, and each page is the same format, but I have a few hundred cases where I have this issue. The data in Col1 was just too large for the margin and carried over to the next line. I'm still learning python, but I'm not sure this would work well for me. – pkpto39 Aug 06 '21 at 00:19

2 Answers2

-1

This is probably a manual way of doing this, but it will do the job. You need to follow these steps:

Merge the top two rows (this can be changed for any other rows if you are getting the same issue elsewhere).
Drop the second row as you will not need it anymore.
Reset the index as dropping the second row will change the index to: 0 2 3.

Here is the code:

df.iloc[0,] = pd.DataFrame([df.loc[0,] + df.loc[1,]])
df = df.drop(df.index[1])
df.reset_index(drop=True)

enter image description here

Amir Charkhi
  • 768
  • 7
  • 23
-1

Try:

import pandas as pd
import re

current_df = pd.DataFrame({ 'Col1':['Something (Something', 'Else) words', 'x', 'y', 'something (another', 'word) blah'], 
                    'Col2':['(some value)', '', 'some value', 'a', 'some value',''],
                    'Col3':['Something (Something', 'Else) words', '(x)', 'y', 'something (another', 'word) blah'], 
                    'Col4':['some value', '', 'some value', 'a', 'some value',''],
                    'Col5':['some value', '', 'some value', 'a', 'some value','']})

print(current_df)

'''Shows:
    
    Col1                    Col2            Col3                    Col4            Col5
0   Something (Something    (some value)    Something (Something    some value      some value
1   Else) words                             Else) words     
2   x                       some value      (x)                     some value      some value
3   y                       a               y                       a               a
4   something (another      some value      something (another      some value      some value
5   word) blah                              word) blah      
'''
print('\n\n')

pattern_check = r'.*?\)(?:.(?!\)))+$'

cols = list(current_df.columns)

current_df['mergeUp'] = False
current_df['mergeUp'] = current_df.apply(lambda x: x.str.contains(pattern_check, regex=True).any(), axis=1)

for col in cols:
    for row in range(1, len(current_df)):
        if current_df.loc[row, 'mergeUp'] is False:
            continue
        elif re.search(pattern_check, current_df.loc[row, col]):
            current_df.loc[row-1, col] = current_df.loc[row-1, col] + ' ' + current_df.loc[row, col]
        else:
            continue

current_df = current_df.loc[~current_df.mergeUp, :]

del current_df['mergeUp']
current_df = current_df.reset_index(drop=True)

print(current_df)

'''
    Col1                                Col2            Col3                                Col4        Col5
0   Something (Something Else) words    (some value)    Something (Something Else) words    some value  some value
2   x                                   some value  (x)                                     some value  some value
3   y                                   a               y                                   a           a
4   something (another word) blah       some value      something (another word) blah       some value  some value
'''

print('\n')

Clearer view of output:

enter image description here

MDR
  • 2,610
  • 1
  • 8
  • 18