0

I have created a module that should remove repeating characters using specific replacements depending on how many times the character repeats. Example, If "a" repeats 4 times, replace "a" with "¤" both values are equal to 1 byte. The problem I'm having is when the file size gets to be above 30KB or so, When I'm finished running the module some how it has increased in byte size. I have tried a few word count programs and apparently it is adding more characters I just haven't been able to fix my code. I'v tried a few ways and would like some assistance or ideas as to how it is adding bytes.

from itertools import groupby

with open("LICENSE.txt","r", encoding='utf-8') as rf, open('TESTINGOnline.txt','w', encoding='utf-8') as wf:
s = rf.read()
ret = ''
for k, v in groupby(s):
    x = 'a'
    chunk = list(v)
    cnt = len(chunk)

    if k == x and cnt <= 1: 
        el = 'ª'.rstrip('\n')
    elif k == x and cnt == 2:
        el = '¨'.rstrip('\n')
    elif k == x and cnt == 3:
        el = '­'.rstrip('\n')
    elif k == x and cnt == 4:
        el = '¤'.rstrip('\n')
    elif k == x and cnt == 5:
        el = '¥'.rstrip('\n')

    else:
        el = ''.join(chunk).rstrip('\n')
    ret += el
wf.write(ret.rstrip('\n'))

1 Answers1

0

The explanation how it comes that the file size grows is quite simple:

print(len(bytes("¥ª¤¨", 'utf-8')))

gives

8

Your assumption that you are replacing one byte with another ONE byte is wrong. You are replacing one UTF-8 character which UTF-8 code is one byte long with ONE UTF-8 character which UTF-8 code is TWO bytes long.

No need to fix your code - juxt fix your assumptions :)

Maybe checking out my answers to the following two questions may help you to gain a better understanding of what a character and what a byte is?

Converting UTF-8 (in literal) to Umlaute

In Python 3, how can I convert ascii to string, *without encoding/decoding*

Community
  • 1
  • 1
Claudio
  • 7,474
  • 3
  • 18
  • 48