I have a dataframe in python which contains all of my data for binary classification. I ingest data in two iterations - once all of the data of one class and then all of the data of the other class. I then run a randomisation of the rows. The problem I have is every time I rerun the script the rows the data frame is recreated and randomised creating unreproducible results.
Should I run the dataframe creation and randomisation from an external file? Is there common practices about data ingestion in model building?
I haven't tried attempted anything in this regard. I was wondering also if it makes sense to do that from a statistical point of view or common practice ? I would try something such as:
import data_ingest
data_ingest.function_data_call()
But then again every time I run the script it also calls the external script which forms the data and randomises it. So that is not the solution I am looking for.
I can't really show an example, I am loading in documents (text files) - document binary classification. The structure of the dataframe is the following:
row| content | class
--------------------------------------
1 | the sky is blue | 0
2 | the river runs deep purple| 0
3 | yellow fever | 0
4 | red strawberries | 1
5 | black orchids are nice | 1
Ingestion Code:
for f in [f for f in os.listdir(path1) if not f.startswith('.')]:
with io.open(path1+f, "r", encoding="utf-8") as myfile:
# data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))
tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')
data1.append(" ".join(tmp1.split()))
df1 = pd.DataFrame(data1, columns=["content"])
df1["class"] = "1"
for f in [f for f in os.listdir(path1) if not f.startswith('.')]:
with io.open(path1+f, "r", encoding="utf-8") as myfile:
# data1.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', ''))
tmp1 = myfile.read().rstrip().replace('-', '').replace('\n', '')
data1.append(" ".join(tmp1.split()))
df1 = pd.DataFrame(data1, columns=["content"])
df1["class"] = "1"
for f in [f for f in os.listdir(path2) if not f.startswith('.')]:
with io.open(path2+f, "r", encoding="utf-8") as myfile:
# data2.append(myfile.read().rstrip().replace('-', '').replace('.', '').replace('\n', '').replace(' ', ''))
tmp2 = myfile.read().rstrip().replace('-', '').replace('\n', '')
data2.append(" ".join(tmp2.split()))
df2 = pd.DataFrame(data2, columns=["content"])
df2["class"] = "0"
### Concatenate the two DataFrame into One and Re-Index
emails = pd.concat([df1,df2], ignore_index=True)
## Randomize Rows
emails = emails.reindex(np.random.permutation(emails.index))