0

I have a dataframe in python which contains all of my data for binary classification. I ingest data in two iterations - once all of the data of one class and then all of the data of the other class. I then run a randomisation of the rows. The problem I have is every time I rerun the script the rows the data frame is recreated and randomised creating unreproducible results.

Should I run the dataframe creation and randomisation from an external file? Is there common practices about data ingestion in model building?

I haven't tried attempted anything in this regard. I was wondering also if it makes sense to do that from a statistical point of view or common practice ? I would try something such as:

import data_ingest
data_ingest.function_data_call()

But then again every time I run the script it also calls the external script which forms the data and randomises it. So that is not the solution I am looking for.

I can't really show an example, I am loading in documents (text files) - document binary classification. The structure of the dataframe is the following:

row|           content         | class
--------------------------------------
1  | the sky is blue           | 0
2  | the river runs deep purple| 0
3  | yellow fever              | 0
4  | red strawberries          | 1
5  | black orchids are nice    | 1

Ingestion Code:

for f in [f for f in os.listdir(path1) if not f.startswith('.')]:
   with io.open(path1+f, "r", encoding="utf-8") as myfile:
     # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))
     tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')
     data1.append(" ".join(tmp1.split()))

df1 = pd.DataFrame(data1, columns=["content"])
df1["class"] = "1"

for f in [f for f in os.listdir(path1) if not f.startswith('.')]:
   with io.open(path1+f, "r", encoding="utf-8") as myfile:
     # data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))
     tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')
     data1.append(" ".join(tmp1.split()))

df1 = pd.DataFrame(data1, columns=["content"])
df1["class"] = "1"

for f in [f for f in os.listdir(path2) if not f.startswith('.')]:
   with io.open(path2+f, "r", encoding="utf-8") as myfile:
     # data2.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', '').replace(' ', ''))
     tmp2 = myfile.read().rstrip().replace('-', '').replace('\n', '')
     data2.append(" ".join(tmp2.split()))

df2 = pd.DataFrame(data2, columns=["content"])
df2["class"] = "0"

### Concatenate the two DataFrame into One and Re-Index
emails = pd.concat([df1,df2], ignore_index=True)

## Randomize Rows 
emails = emails.reindex(np.random.permutation(emails.index))
OAK
  • 2,994
  • 9
  • 36
  • 49
  • Sounds like a pandas question. If so please tag it accordingly. Also please show what you have tried so far. – karlson Jan 16 '16 at 13:00
  • How does your input look like? Can you post a link to your data? If not a sample of your data or an example of the format? Also, what is the purpose of reading the data (i.e. what kind of processing do you want to do with it? What are the input types of your data? – alvas Jan 16 '16 at 13:17

1 Answers1

1

If you want to reproduce the same result after (pseudo-)randomization, you can set the random seed. Each time you use the same seed, you get the same sequence of random numbers.

Secondly, you can save the intermediate result either to a file, a JSON or a pickle. You can check if it already exists, and if not, recreate it.

JulienD
  • 7,102
  • 9
  • 50
  • 84