2

I'm trying to create a dataframe in Pandas that has two variables ("date" and "time_of_day" where "date" is 120 observations long with 30 days (each day has four observations: 1,1,1,1; 2,2,2,2; etc.) and then the second variable "time_of_day) repeats 30 times with values of 1,2,3,4.

The closest I found to this question was here: How to create a series of numbers using Pandas in Python, which got me the below code, but I'm receiving an error that it must be a 1-dimensional array.

df = pd.DataFrame({'date': np.tile([pd.Series(range(1,31))],4), 'time_of_day': pd.Series(np.tile([1, 2, 3, 4],30 ))})

So the final dataframe would look something like

date time_of_day
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4

Thanks much!

James
  • 459
  • 2
  • 14

2 Answers2

6

you need once np.repeat and once np.tile

df = pd.DataFrame({'date': np.repeat(range(1,31),4), 
                   'time_of_day': np.tile([1, 2, 3, 4],30)})
print(df.head(10))
   date  time_of_day
0     1            1
1     1            2
2     1            3
3     1            4
4     2            1
5     2            2
6     2            3
7     2            4
8     3            1
9     3            2

or you could use pd.MultiIndex.from_product, same result.

df = (
    pd.MultiIndex.from_product([range(1,31), range(1,5)], 
                               names=['date','time_of_day'])
      .to_frame(index=False)
)

or product from itertools

from itertools import product
df = pd.DataFrame(product(range(1,31), range(1,5)), columns=['date','time_of_day'])
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • 1
    Just what I needed...thanks much! Thanks also for including alternative answers. – James Dec 01 '21 at 18:13
  • As a follow-up question (which maybe should just be a new question), let's say I had a vector of participant values, and I wanted the date and time_of_day columns to repeat (such that each participant received the 120 observations); how would I do that? Essentially, it would be repeating the data frame you created for each participant (either created as one data frame, or then concatenated into one dataframe). – James Dec 01 '21 at 20:00
  • 1
    @James the easiest is then to use the method `...from_product([list_of_participants, range(1,31), range(1,5)], names=['participant','day','time_of_day'])....` – Ben.T Dec 01 '21 at 20:50
  • 1
    @James if you want toreuse the df created, then `pd.concat([df.assign(participant=_id) for _id in list_of_participant], ignore_index=True)` should work too (check for typo) – Ben.T Dec 01 '21 at 20:53
  • Thanks; that's exactly what I want, but then I'm having issues joining it with my original dataset to create a full data frame representing missing values. I thought I could just join on participant but I end up with far more values than I should have. Any idea how I would solve that? I just created the question here to make it easier: https://stackoverflow.com/questions/70189766/creating-full-dataframe-representing-missing-values – James Dec 01 '21 at 21:46
  • I'm not sure about this but what is the result of putting the original data frame (with the actual data, let's call it data) into your code here ``` new = pd.concat([data.assign(participant=_id) for _id in list_of_participant], ignore_index=True) ``` It gives more values than I should have (19,544 vs something closer to 1,800), so trying to figure out what's going on. – James Dec 01 '21 at 21:49
  • @James so you have more rows than expected because you have duplicates rows for the subset ['pid','date','time_of_day'], in your other question, rows index 1 and 2 are actually in this case. I see thay have different datestamps, but same 'pid','date','time_of_day'], it is still duplicated values that create more rows than you expect – Ben.T Dec 01 '21 at 22:24
  • How would I create a data frame though where I don't have those duplicates (every row has pid, date, and appropriate time of day) but could then join the two datasets such that the time of day of missing values for each pid could be detected (from the dataset we created with all values, being joined to a dataset without the full 120 observations per participant)? – James Dec 02 '21 at 18:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239791/discussion-between-ben-t-and-james). – Ben.T Dec 02 '21 at 19:01
3

New feature in merge cross

out = pd.DataFrame(range(1,31)).merge(pd.DataFrame([1, 2, 3, 4]),how='cross')
BENY
  • 317,841
  • 20
  • 164
  • 234