2

I am taking a dataframe, breaking it into two dataframes, and then I need to change the index values so that no number is greater than the total number of rows.

Here's the code:

dataset =   pd.read_csv("dataset.csv",usecols['row_id','x','y','time'],index_col=0)
splitvalue = math.floor((0.9)*786239)
train = dataset[dataset.time < splitvalue]
test = dataset[dataset.time >= splitvalue]

Here's the change that I am doing. I am wondering if there is an easier way:

test.index=range(test.shape[0])
test.index.rename('row_id',inplace=True)

Is there a better way to do this?

Larry Freeman
  • 418
  • 6
  • 19

3 Answers3

3

try:

test = test.reset_index(drop=True).rename_axis('row_id')
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

You should shuffle your data before slicing....

dataset.reindex(np.random.permutation(dataset.index))

Otherwise your biasing your test/train sets.

Merlin
  • 24,552
  • 41
  • 131
  • 206
  • Thanks for the suggestion. I didn't realize that the shuffling could be done through reindexing. Cool. – Larry Freeman Jun 10 '16 at 00:01
  • @LarryFreeman, dont check with head on the new dataframe.. Head sorts on the index then displays... Drove me nuts for while. – Merlin Jun 10 '16 at 00:04
  • If I don't check with head(), what's the alternative? – Larry Freeman Jun 10 '16 at 00:05
  • I wasnt clever I just used the above command in the Ipython notebook cell looked the top five... You could look to slice with fancy indexing, I didnt try. – Merlin Jun 10 '16 at 00:12
2

You can assign a new Index object directly to overwrite the index:

test.index = pd.Index(np.arange(len(df)), name='row_id')
EdChum
  • 376,765
  • 198
  • 813
  • 562