0

I made a code to pick the company number field in a "companies partner's" file and then compare with a file with companies list of a specific State of the Country, writing the result to a third file (final result: to have the partners of all the companies of that State). The code is simple:

import pandas as pd

dsocio = pd.read_csv('D:/CNPJ-full-master/cnpj-csv/socios.csv', chunksize=262144, low_memory=False)
duf = pd.read_csv('D:/pyData/receita/empES.csv', usecols = ['cnpj'], low_memory=False)

for chunk in dsocio:
    result = chunk[chunk['cnpj'].isin(duf.cnpj)]
    result.to_csv('D:/CNPJ-full-master/cnpj-csv/UFs/socioES.csv', index=False, header=True, mode='a')

The problem is, I have two versions of the "empES.csv" file. They have different number of columns, but both have the field 'cnpj' as the first column. And this is the only field I need. When I run the code passing the version 1 file, it runs perfectly. But, when I try to open the version 2 instead, my output file starts being populated with only the header. Many lines with the header!

Here are some snippets of the first lines of:

  1. The partners file (socios.csv, the one I will copy the matching lines from):

''' "cnpj","tipo_socio","nome_socio","cnpj_cpf_socio","cod_qualificacao","perc_capital","data_entrada","cod_pais_ext","nome_pais_ext","cpf_repres","nome_repres","cod_qualif_repres"

"00000000000191","2","MARCIO HAMILTON FERREIRA","*923641","10",0.0,"20101117","","","","","00" "00000000000191","2","NILSON MARTINIANO MOREIRA","*491386","10",0.0,"20101117","","","","","00" "00000000002135","2","DEBORA CRISTINA FONSECA","*314628","08",0.0,"20200312","","","","","00" "00000000002216","2","WALDERY RODRIGUES JUNIOR","*025913","08",0.0,"20200312","","","","","00" "00000000002216","2","ERIK DA COSTA BREYER","*093217","10",0.0,"20191209","","","","","00" "00000000002216","2","THOMPSON SOARES PEREIRA CESAR","*503187","10",0.0,"20191209","","","","","00" "00000000002569","2","WALTER MALIENI JUNIOR","*718468","10",0.0,"20101117","","","","","00" "00000000002569","2","NILSON MARTINIANO MOREIRA","*491386","10",0.0,"20101117","","","","","00" "00000000002640","2","WALDERY RODRIGUES JUNIOR","*025913","08",0.0,"20200312","","","","","00" '''

  1. The working companies file (empES.csv), from which I read only the 'cnpj' field:

''' cnpj,identificador_matriz_filial,razao_social,nome_fantasia,situacao_cadastral,data_situacao_cadastral,motivo_situacao_cadastral,nome_cidade_exterior,codigo_natureza_juridica,data_inicio_atividade,cnae_fiscal,descricao_tipo_logradouro,logradouro,numero,complemento,bairro,cep,uf,codigo_municipio,municipio,ddd_telefone_1,ddd_telefone_2,ddd_fax,qualificacao_do_responsavel,capital_social,porte,opcao_pelo_simples,data_opcao_pelo_simples,data_exclusao_do_simples,opcao_pelo_mei,situacao_especial,data_situacao_especial

2135,2,BANCO DO BRASIL SA,VITORIA - ES,2,2005-11-03,0,,2038,1966-08-01,6421200,PRACA,PIO XII,30,,CENTRO,29010340.0,ES,5705,VITORIA,,,,10,0.0,5,0,,,0,, 8338,2,BANCO DO BRASIL SA,CACHOEIRO DE ITAPEMIRIM-ES-EST UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,PRACA,JERONIMO MONTEIRO,26,,CENTRO,29300902.0,ES,5623,CACHOEIRO DE ITAPEMIRIM,,,,10,0.0,5,0,,,0,, 11207,2,BANCO DO BRASIL SA,COLATINA-ES-EST.UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,EXPED ABILIO DOS SANTOS,124,,CENTRO,29700070.0,ES,5629,COLATINA,,,,10,0.0,5,0,,,0,, 18643,2,BANCO DO BRASIL SA,,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,PRESIDENTE VARGAS,29,,CENTRO,29400000.0,ES,5667,MIMOSO DO SUL,,,,10,0.0,5,0,,,0,, 19615,2,BANCO DO BRASIL SA,,2,2005-11-03,0,,2038,1982-05-04,6421200,AVENIDA,SENADOR EURICO RESENDE,994,,CENTRO,29845000.0,ES,5619,BOA ESPERANCA,,,,10,0.0,5,0,,,0,, 20974,2,BANCO DO BRASIL SA,SANTA TERESA ES-EST UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,JERONIMO VERVLOET,178,,CENTRO,29650000.0,ES,5691,SANTA TERESA,,,,10,0.0,5,0,,,0,, '''

  1. The new companies file (empES.csv), which gives me the weird behavior:

''' cnpj,matriz_filial,razao_social,nome_fantasia,situacao,data_situacao,motivo_situacao,nm_cidade_exterior,cod_pais,nome_pais,cod_nat_juridica,data_inicio_ativ,cnae_fiscal,tipo_logradouro,logradouro,numero,complemento,bairro,cep,uf,cod_municipio,municipio,ddd_1,telefone_1,ddd_2,telefone_2,ddd_fax,num_fax,email,qualif_resp,capital_social,porte,opc_simples,data_opc_simples,data_exc_simples,opc_mei,sit_especial,data_sit_especial

2135,2,BANCO DO BRASIL SA,VITORIA - ES,2,20051103,0,,,,2038,19660801,6421200,PRACA,PIO XII,30,,CENTRO,29010340.0,ES,5705,VITORIA,,,,,,,AGE0021@BB.COM.BR,10,0.0,5,0,,,,, 8338,2,BANCO DO BRASIL SA,CACHOEIRO DE ITAPEMIRIM-ES-EST UNIF,2,20051103,0,,,,2038,19660801,6421200,PRACA,JERONIMO MONTEIRO,26,,CENTRO,29300902.0,ES,5623,CACHOEIRO DE ITAPEMIRIM,,,,,,,,10,0.0,5,0,,,,, 11207,2,BANCO DO BRASIL SA,COLATINA-ES-EST.UNIF,2,20051103,0,,,,2038,19660801,6421200,RUA,EXPED ABILIO DOS SANTOS,124,,CENTRO,29700070.0,ES,5629,COLATINA,,,,,,,,10,0.0,5,0,,,,, 18643,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,RUA,PRESIDENTE VARGAS,29,,CENTRO,29400000.0,ES,5667,MIMOSO DO SUL,,,,,,,,10,0.0,5,0,,,,, 19615,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19820504,6421200,AVENIDA,SENADOR EURICO RESENDE,994,,CENTRO,29845000.0,ES,5619,BOA ESPERANCA,,,,,,,,10,0.0,5,0,,,,, 20974,2,BANCO DO BRASIL SA,SANTA TERESA ES-EST UNIF,2,20051103,0,,,,2038,19660801,6421200,RUA,JERONIMO VERVLOET,178,,CENTRO,29650000.0,ES,5691,SANTA TERESA,,,,,,,,10,0.0,5,0,,,,, 22241,2,BANCO DO BRASIL SA,SAO MATEUS ES EST UNIF,2,20051103,0,,,,2038,19660801,6421200,AVENIDA,JONES DOS SANTOS NEVES,324,,CENTRO,29930010.0,ES,5697,SAO MATEUS,,,,,,,,10,0.0,5,0,,,,, 28100,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,AVENIDA,JERONIMO MONTEIRO,38/46,,CENTRO,29500000.0,ES,5603,ALEGRE,,,,,,,,10,0.0,5,0,,,,, 37001,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,RUA,DEMERVAL AMARAL,35,,CENTRO,29560000.0,ES,5645,GUACUI,,,,,,,,10,0.0,5,0,,,,, '''

Here's a sample of the output when I pass the first empES.csv file:

''' cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres

2135,2,WALDERY RODRIGUES JUNIOR,*025913,8,0.0,20200312,,,,,0 2135,2,ERIK DA COSTA BREYER,*093217,10,0.0,20191209,,,,,0 2135,2,THOMPSON SOARES PEREIRA CESAR,*503187,10,0.0,20191209,,,,,0 2135,2,MAURICIO NOGUEIRA,*894537,10,0.0,20191209,,,,,0 2135,2,DANIEL ANDRE STIELER,*145110,10,0.0,20190910,,,,,0 2135,2,ENIO MATHIAS FERREIRA,*078106,10,0.0,20181107,,,,,0 2135,2,RONALDO SIMON FERREIRA,*685018,10,0.0,20190729,,,,,0 2135,2,IVANDRE MONTIEL DA SILVA,*975660,10,0.0,20190403,,,,,0 2135,2,FABIO AUGUSTO CANTIZANI BARBOSA,*379967,10,0.0,20190403,,,,,0 2135,2,CARLOS MOTTA DOS SANTOS,*876287,10,0.0,20190403,,,,,0 2135,2,CAMILO BUZZI,*569178,10,0.0,20190403,,,,,0 '''

And here's what happens when I try to use the other "empES.csv" file:

''' j,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres '''

...and goes like this forever.

I have no clue why the first one goes fine through the code and why the second gives that output, it's like the .isin is not iterating in that case!

Any thoughts?

ps: All the data presented here is public domain from Brazil's government.

  • Can you be more precise and simple with your question and issue? – Narendra Prasath Jul 03 '20 at 05:26
  • Thank you for the reply! I already found the answer, it was a csv file with the column name repeating in some lines. I didn't understood why it has led the code to that output, thought. But removing these lines did the trick! – Valter Carvalho Jul 03 '20 at 12:36

1 Answers1

0

Well, at the end it was a column with bad values. Basically I exported a file with only the 'cnpj' column:

import pandas as pd
duf = pd.read_csv('D:/CNPJ-full-master/cnpj-csv/UFs/empES.csv', usecols = ['cnpj'], low_memory=False)
duf.to_csv('D:/CNPJ-full-master/cnpj-csv/UFs/empES-cnpj.csv', index=False)`

Then I looked on it with notepad++. I saw there was a column in the middle of it with 'cnpj' again, instead of a value. Then I looked for it and found 200 more lines with the same 'cnpj' in the place of values. Well, in approximately 900.000 lines, 200 is not a lot, so I just removed them, and it finally works. Anyway, although the problem is fixed, I don't know why a non numeric value has crashed the code that way. Must have something to do with the fact the string value is the same as the column name.