0

I have a txt file that looks like this:

...
|J150|DRE.16.2|T|2|DRE.16|PROVISAO P  CSLL|6779,24|D|D||
|J150|DRE.16.2.001|D|3|DRE.16.2|CSLL|6779,24|D|D||
|J150|DRE.17|T|1||LUCRO DO EXERCICIO|55797,1|C|R||
|J005|01012018|31122018|1||
|J100|BP.01|T|1||A|ATIVO|5540527,48|D|8656252,32|D||
|J100|BP.01.1|T|2|BP.01|A|ATIVO CIRCULANTE|5030370,68|D|7881200,94|D||
|J100|BP.01.1.1|T|3|BP.01.1|A|DISPONIBILIDADES|380741,7|D|777224,63|D||
|J100|BP.01.1.1.01|T|4|BP.01.1.1|A|CAIXA|96786,62|D|69935,41|D||
|J100|BP.01.1.1.01.001|D|5|BP.01.1.1.01|A|Caixa|96786,62|D|69935,41|D||
...

It is quite long. I want to separate in a new file only the lines that start with "|J100|". I've tried some of the answers here but didn't work in my case. Below my trials:

path="file.txt"
open('newfile','w').writelines([ line for line in open(path) if '|J100|' in line])

Didn't work, got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 255: invalid start byte

Then I tried this one:

with open(path,'rb') as f,open('new.txt','wb') as g:
    g.writelines(filter(lambda line: '|J100|' in line, f))

And got this as response: TypeError: a bytes-like object is required, not 'str'

Any ideas?

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
aabujamra
  • 4,494
  • 13
  • 51
  • 101
  • Try `b'|J100|' in line` or `'|J100|' in str(line)` – rassar Nov 02 '19 at 16:21
  • 1
    str(line) worked perfectly! thanks! – aabujamra Nov 02 '19 at 16:23
  • The proper solution is to figure out the actual encoding, or clean out data which cannot be properly interpreted. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Nov 03 '19 at 09:27

1 Answers1

0

If

path="file.txt"
open('newfile','w').writelines([ line for line in open(path) if '|J100|' in line])

raises a UnicodeDecodeError then the contents of file.text are not encoded as UTF-8.

This code

with open(path,'rb') as f,open('new.txt','wb') as g:
    g.writelines(filter(lambda line: '|J100|' in line, f))

raises a TypeError because you are reading the file in binary mode, so its contents are output as bytes, but the lambda is comparing these bytes to a string value ('|J100|'). The best approach is to compare bytes with bytes (b'|J100|'). Also, if you only want lines that begin with a specific value, use bytes.startswith to filter lines that contain the |J100| after the start:

with open(path,'rb') as f,open('new.txt','wb') as g:
    g.writelines(filter(lambda line: line.startswith(b'|J100|'), f))
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153