1

I am new to python and I need to manipulate a little the following df. What I basically need is to move from this:

id name start end age1 age2 age3 age4 age5 age6
1 x 01:00 01:02 3_10 11_20 21_30 n/a n/a n/a
2 y 01:05 01:07 n/a 11_20 21_30 31_40 41_50 n/a
3 z 01:10 01:12 n/a n/a n/a 31_40 41_50 n/a

To this:

id name start end age
1 x 01:00 01:02 3_10
1 x 01:00 01:02 11_20
1 x 01:00 01:02 21_30
2 y 01:05 01:07 11_20
2 y 01:05 01:07 21_30
2 y 01:05 01:07 31_40
2 y 01:05 01:07 41_50

and so on for every ID.

As you may notice I need to transform the initial df in order to write a nested .json but I've never worked with pandas. I suppose there is some sort of Join that can be useful since other methods I tried were epic fails, but I'm super stuck.

Is there someone that can give me some hints? Thanks in advance

EDIT Since I haven't specified it before: the df i showed is only a little extraction of the real one. In fact I have to work with much more columns and rows.

SOLUTION FOUND

Here to write the solution I found after writing this post. To achieve the goal I used the pandas function .melt(). It works great! Below I attach the code:

df = pd.melt(df, id_vars=['ID', 'name1', 'name2', 'target', 'sexF', 'sexM', 'start', 'end'],
              value_vars=['age1', 'age2', 'age3', 'age4', 'age5', 
                          'age6', 'age7'],
              value_name='age',
              var_name='columnX')
gtcl
  • 11
  • 3

1 Answers1

0

if i understood the assignment, here's a solution:


df = pd.DataFrame({
    'id': [1, 2],
    'name': ['x', 'y'],
    'start': ['01:00', '01:05'],
    'end': ['01:02', '01:07'],
    'age1': ['3_10', 'n/a'],
    'age2': ['11_20', '11_20'],
    'age3': ['21_30', '21_30'],
    'age4': ['n/a', '31_40'],
    'age5': ['n/a', '41_50'],
    'age6': ['n/a', 'n/a']
})
# for each age create a new record with the same id and name and start and end and age
df2 = pd.DataFrame()
for i in range(1, 7):
    df2 = df2.append(df[['id', 'name', 'start', 'end', 'age' + str(i)]].rename(columns={'age' + str(i): 'age'}))
# drop the rows with n/a
df2 = df2[df2.age != 'n/a']

print(df2)

if its not what you want clarify more and i'll steer the solution for you.

Hannon qaoud
  • 785
  • 2
  • 21
  • First of all thank u @Hannon qaoud. Actually I got the point of your strategy but a doubt still remains. Can I use this procedure if I have, for example, more than 150 rows (IDs)? – gtcl Sep 23 '22 at 19:01
  • Moreover, just tried to run your code but for some reasons I receive an error: KeyError: "['age_break_15_241'] not in index". – gtcl Sep 23 '22 at 19:25
  • sorry for the delay in the response, first of all replying to your first comment, I am no expert but eventually, using a fancy function or looping on all data manually, somehow somewhere deep in the code an iterator is going to go over all records, and do calculations on them, so in my opinion i see no foul in doing it manually. – Hannon qaoud Sep 23 '22 at 21:12
  • now responding to your second comment, i can't tell if the error is caused from the data itself or from the logic, if you could provide more details on the error, the line that caused, and the full error text, i might be able to help. – Hannon qaoud Sep 23 '22 at 21:16
  • the raised error seems to be caused from a calling of a certain row, which shouldn't occur from my code, so if you've edit my code, i would start by going over these edits and make sure they're valid, also dont forget to mark the correct answer for future references :) – Hannon qaoud Sep 23 '22 at 21:21
  • I actually found a way and i write it here cause it can be useful also for others: I used the .melt() function! But, anyway, thank you for your help! :) – gtcl Sep 23 '22 at 21:22
  • great idea, share it with others, happy coding. – Hannon qaoud Sep 23 '22 at 21:25