0

I have a txt file that I read in through python that comes like this:

Text File:
18|Male|66|180|Brown
23|Female|67|120|Brown
16|71|192|Brown
22|Male|68|185|Brown
24|Female|62|100|Blue

One of the rows has missing data and the problem is that when I read it into a dataframe it appears like this:

   Age  Gender  Height Weight   Eyes
0   18    Male      66    180  Brown
1   23  Female      67    120  Brown
2   16      71     192  Brown    NaN
3   22    Male      68    185  Brown
4   24  Female      62    100   Blue

I'm wondering if there is a way to shift the row that has missing data over without shifting all columns. Here is what I have so far:

import pandas as pd

df = pd.read_csv('C:/Documents/file.txt', sep='|', names=['Age','Gender', 'Height', 'Weight', 'Eyes'])

df_full = df.loc[df['Gender'].isin(['Male','Female'])]

df_missing = df.loc[~df['Gender'].isin(['Male','Female'])]

df_missing = df_missing.shift(1,axis=1)

df_final = pd.concat([df_full, df_missing])

I was hoping to just separate out the ones with missing data, shift the columns by one, and then put the dataframe back to the data that has no missing data. But I'm not sure how to go about shifting the columns at a certain point. This is the result I'm trying to get to:

   Age  Gender  Height Weight   Eyes
0   18    Male      66    180  Brown
1   23  Female      67    120  Brown
2   16     NaN      71    192  Brown
3   22    Male      68    185  Brown
4   24  Female      62    100   Blue

It doesn't really matter how I get it done, but the files I'm using have thousands of rows so I can not fix them individually. Any help is appreciated. Thanks!

  • Was able to figure out how to do a single row adjustment, which could be applied to all rows using `apply()`. @Steveiepete – Todd Mar 16 '20 at 23:07

2 Answers2

1

RegEx could help here.

Searching for ^(\d+\|)(\d) and making a replacement using $1|$2 (just added one vertical bar where Gender is missing "group 1 + | + group 2")

This could be done in almost every text editors (Notepad++, VSC, Sublime etc.)

See the example following the link: https://regexr.com/50gkh

P47 R1ck
  • 148
  • 1
  • 12
1

Selectively shift a portion of each of the rows that have missing values.

df.apply(lambda r: r[:1].append(r[1:].shift()) 
                   if r['Gender'] not in ['Male', 'Female'] 
                   else r, axis=1)

The misaligned column data for each affected record will be aligned with 'NaN' inserted where the missing value was in the input text.

   Age  Gender  Height Weight   Eyes              Age  Gender Height Weight   Eyes
1   23  Female      67    120  Brown           1   23  Female     67    120  Brown
2   16      71     192  Brown    NaN  ======>  2   16     NaN     71    192  Brown

For a single record, this'll do it:

df.loc[2] = df.loc[2][:1].append(df.loc[2][1:].shift())

Starting at the 'Gender' column, data is shifted right. The default fill is 'NaN'. The 'Age' column is preserved.

Todd
  • 4,669
  • 1
  • 22
  • 30