1

I'm trying to do the following with an SRT (subtitles) file:

  • while a row does not appear on the screen for at least 5s
  • add text from the next row to current row with a space between AND replace current End_Time with next row End_Time
  • delete next row
  • go to next row

I have to do that on the dataframe dfClean with the edited timestamp fields and then do the same to the dataframe with the original SRT time format dfSRTForm so I can export the latter later as an SRT file.

My code to do that is this:

for i in dfClean.index:
    while dfClean.at[i, 'Difference'] < 5:
        dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
        dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']
    
        dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
        dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']
    
        dfClean = dfClean.drop(i+1)
        dfSRTForm = dfSRTForm.drop(i+1)

But I get this error:

KeyError: 3

UPDATE (keeping previous if anyone else is having the same issue): I found a way to reset the index to avoid KeyError: 3

My current code is:

for i in dfClean.index:
    while dfClean.at[i, 'Difference'] < 5:
        dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
        dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']
    
        dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
        dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']
    
        dfClean = dfClean.drop(i+1)
        dfSRTForm = dfSRTForm.drop(i+1)
    
        dfClean = dfClean.reset_index()
        dfClean = dfClean.drop(columns='index')
    
        dfSRTForm = dfSRTForm.reset_index()
        dfSRTForm = dfSRTForm.drop(columns='index')
    
        dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')

But I get KeyError: 267 and I'm pretty sure it's because it condenses the rows to 266.

Is there a way to put "or end of index" or "or last row" in the while loop without hard coding the 266 lines? I want to use it for other SRT files with different varying number of rows.

  • 1
    modifying a dataframe you are looping over can cause many unwanted side effects. A simple solution would be to create a new dataframe with the modified rows that you wish to keep – oskros Jul 19 '22 at 11:31
  • Warm welcome to SO. Please read https://stackoverflow.com/help/how-to-ask and https://stackoverflow.com/help/minimal-reproducible-example and update your question. – buhtz Jul 19 '22 at 11:35
  • @oskros this sounds great, but how do I avoid modifying the current df if the whole point is to keep adding next row's text to current until it reaches 5s or more? I need to delete i+1 whenever I'm done copying text from it or I'll be adding the same subtitles again to current line. – Davidodocola Jul 19 '22 at 11:38
  • @Davidodocola just keep a temporary variable with the time, when it goes above 5s, you append a row to the new dataframe and reset the temp variable – oskros Jul 19 '22 at 11:40
  • @oskros I updated the original after I partially took care of the issue. Do you know of a way to fix what I have right now without hard coding the value for rows? – Davidodocola Jul 19 '22 at 12:08

3 Answers3

1

You can define an empty list, then loop over your dataframe rows and if it doesn't fulfil your condition save the index to that list.

After that do the following:

df = df.drop(index=your_indices)
buhtz
  • 10,774
  • 18
  • 76
  • 149
bpfrd
  • 945
  • 3
  • 11
  • this is not exactly what I'm looking for. I'm trying to delete "next row" after I'm done copying the ```Text``` value to current row until the current row is 5s or longer. – Davidodocola Jul 19 '22 at 11:40
  • basically, you can do `df.drop(index=next_row_index, inplace=True)` but then your loop index might get out of bounds. – bpfrd Jul 19 '22 at 11:44
  • thank you, that gave me an idea on resetting the index and found a way to do that. I updated the original post after I partially took care of the issue. Do you know of a way to fix what I have right now without hard coding the value for rows? – Davidodocola Jul 19 '22 at 12:09
0

Without having a look at your data I cannot make a precise solution. But below should serve as an example of how to accomplish what you are doing

dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')

tmp_diff = 0
tmp_txt = ''
new_data = []
for i, row in dfClean.iterrows():
    if tmp_diff < 5:
        tmp_txt = ' '.join([tmp_row, row['Text'])
        tmp_diff += row['Difference']
    else:
        new_row = dict(row)
        new_row['Text'] = tmp_txt
        new_row['End_Time'] = row['End_Time']
        new_row['Difference'] = tmp_diff
        new_data.append(new_row)
        
        tmp_txt = ''
        tmp_diff = 0

new_df = pd.DataFrame(new_data)
oskros
  • 3,101
  • 2
  • 9
  • 28
0

This is how I ended up fixing it:

indexKeep = len(dfClean.index)
minSec = 3 # min number of seconds of screen time per line of subtitles.

for i in range(0, indexKeep):
    try:
        while dfClean.at[i, 'Difference'] < minSec:
            dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] + ' ' + dfClean.at[i+1, 'Text']
            dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] + ' ' + dfSRTForm.at[i+1, 'Text']
        
            dfClean.at[i, 'End_Time'] = dfClean.at[i+1, 'End_Time']
            dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i+1, 'End_Time']
        
            dfClean = dfClean.drop(i+1)
            dfSRTForm = dfSRTForm.drop(i+1)
        
            dfClean = dfClean.reset_index()
            dfClean = dfClean.drop(columns='index')
        
            dfSRTForm = dfSRTForm.reset_index()
            dfSRTForm = dfSRTForm.drop(columns='index')
        
            dfClean['Difference'] = (dfClean['End_Time']-dfClean['Start_Time']).astype('timedelta64[s]')
            
            dfClean.at[i, 'ID'] = i+1
            dfSRTForm.at[i, 'ID'] = i+1
            indexKeep = len(dfClean.index)
    except KeyError: # Takes care of condensed number of rows
        pass

This deletes the next row, resets the index numbers so you don't get stuck on KeyError in the middle, and then takes care of the KeyError at the end. The one at the end is a result of initializing the for loop to go for over 800 lines but the condensation that the for loop does makes the total about to 400 lines, which means it eventually can't find "401" when it gets there.