I was using file.write() to add numerical data to a text file. However, after 516159 characters, something interesting happens: about half of the time I run my code, it drops the last 7k characters. The other half, it works fine. Here is some code:
#Create or open file (it strangely couldn't create the file without using mode='x')
try:
corpus_txt = open("corpus.txt", mode = "x")
except:
corpus_txt = open("corpus.txt", mode = "w")
corpus_txt.truncate(0)#delete contents
content_length = 0
#X_train is a 2D array of integers
for sentence in X_train:
for word in sentence:
corpus_txt.write(str(word)+" ")
content_length += len(str(word)+" ")
corpus_txt.write("\n")
content_length += 1
corpus_txt = open("corpus.txt")
content = corpus_txt.read()
corpus_txt.close()
print("FILE LENGTH (chars):", len(content))
print("TOTAL LENGTH OF TEXT ADDED TO FILE:", content_length)
When I run this repeatedly with my data:
- "content_length" always equals 523379
- len("content") alternates between the values 516247 and 523379
Some other information:
- The missing text occurs at the end of the data (the last 7k characters)
- It's not the increment of content_length at the newline
- My data is not altered during this code process
- I am using Google Colab
- I get 516k slightly more often than 523k
- There's no particular pattern for the switches
- It shouldn't be something about the formatting of the read() method because, once again, it's only the last 7k characters that are missing
I would greatly appreciate any help/explanation here. Thanks!