I made a code to pick the company number field in a "companies partner's" file and then compare with a file with companies list of a specific State of the Country, writing the result to a third file (final result: to have the partners of all the companies of that State). The code is simple:
import pandas as pd
dsocio = pd.read_csv('D:/CNPJ-full-master/cnpj-csv/socios.csv', chunksize=262144, low_memory=False)
duf = pd.read_csv('D:/pyData/receita/empES.csv', usecols = ['cnpj'], low_memory=False)
for chunk in dsocio:
result = chunk[chunk['cnpj'].isin(duf.cnpj)]
result.to_csv('D:/CNPJ-full-master/cnpj-csv/UFs/socioES.csv', index=False, header=True, mode='a')
The problem is, I have two versions of the "empES.csv" file. They have different number of columns, but both have the field 'cnpj' as the first column. And this is the only field I need. When I run the code passing the version 1 file, it runs perfectly. But, when I try to open the version 2 instead, my output file starts being populated with only the header. Many lines with the header!
Here are some snippets of the first lines of:
- The partners file (socios.csv, the one I will copy the matching lines from):
''' "cnpj","tipo_socio","nome_socio","cnpj_cpf_socio","cod_qualificacao","perc_capital","data_entrada","cod_pais_ext","nome_pais_ext","cpf_repres","nome_repres","cod_qualif_repres"
"00000000000191","2","MARCIO HAMILTON FERREIRA","*923641","10",0.0,"20101117","","","","","00" "00000000000191","2","NILSON MARTINIANO MOREIRA","*491386","10",0.0,"20101117","","","","","00" "00000000002135","2","DEBORA CRISTINA FONSECA","*314628","08",0.0,"20200312","","","","","00" "00000000002216","2","WALDERY RODRIGUES JUNIOR","*025913","08",0.0,"20200312","","","","","00" "00000000002216","2","ERIK DA COSTA BREYER","*093217","10",0.0,"20191209","","","","","00" "00000000002216","2","THOMPSON SOARES PEREIRA CESAR","*503187","10",0.0,"20191209","","","","","00" "00000000002569","2","WALTER MALIENI JUNIOR","*718468","10",0.0,"20101117","","","","","00" "00000000002569","2","NILSON MARTINIANO MOREIRA","*491386","10",0.0,"20101117","","","","","00" "00000000002640","2","WALDERY RODRIGUES JUNIOR","*025913","08",0.0,"20200312","","","","","00" '''
- The working companies file (empES.csv), from which I read only the 'cnpj' field:
''' cnpj,identificador_matriz_filial,razao_social,nome_fantasia,situacao_cadastral,data_situacao_cadastral,motivo_situacao_cadastral,nome_cidade_exterior,codigo_natureza_juridica,data_inicio_atividade,cnae_fiscal,descricao_tipo_logradouro,logradouro,numero,complemento,bairro,cep,uf,codigo_municipio,municipio,ddd_telefone_1,ddd_telefone_2,ddd_fax,qualificacao_do_responsavel,capital_social,porte,opcao_pelo_simples,data_opcao_pelo_simples,data_exclusao_do_simples,opcao_pelo_mei,situacao_especial,data_situacao_especial
2135,2,BANCO DO BRASIL SA,VITORIA - ES,2,2005-11-03,0,,2038,1966-08-01,6421200,PRACA,PIO XII,30,,CENTRO,29010340.0,ES,5705,VITORIA,,,,10,0.0,5,0,,,0,, 8338,2,BANCO DO BRASIL SA,CACHOEIRO DE ITAPEMIRIM-ES-EST UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,PRACA,JERONIMO MONTEIRO,26,,CENTRO,29300902.0,ES,5623,CACHOEIRO DE ITAPEMIRIM,,,,10,0.0,5,0,,,0,, 11207,2,BANCO DO BRASIL SA,COLATINA-ES-EST.UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,EXPED ABILIO DOS SANTOS,124,,CENTRO,29700070.0,ES,5629,COLATINA,,,,10,0.0,5,0,,,0,, 18643,2,BANCO DO BRASIL SA,,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,PRESIDENTE VARGAS,29,,CENTRO,29400000.0,ES,5667,MIMOSO DO SUL,,,,10,0.0,5,0,,,0,, 19615,2,BANCO DO BRASIL SA,,2,2005-11-03,0,,2038,1982-05-04,6421200,AVENIDA,SENADOR EURICO RESENDE,994,,CENTRO,29845000.0,ES,5619,BOA ESPERANCA,,,,10,0.0,5,0,,,0,, 20974,2,BANCO DO BRASIL SA,SANTA TERESA ES-EST UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,JERONIMO VERVLOET,178,,CENTRO,29650000.0,ES,5691,SANTA TERESA,,,,10,0.0,5,0,,,0,, '''
- The new companies file (empES.csv), which gives me the weird behavior:
''' cnpj,matriz_filial,razao_social,nome_fantasia,situacao,data_situacao,motivo_situacao,nm_cidade_exterior,cod_pais,nome_pais,cod_nat_juridica,data_inicio_ativ,cnae_fiscal,tipo_logradouro,logradouro,numero,complemento,bairro,cep,uf,cod_municipio,municipio,ddd_1,telefone_1,ddd_2,telefone_2,ddd_fax,num_fax,email,qualif_resp,capital_social,porte,opc_simples,data_opc_simples,data_exc_simples,opc_mei,sit_especial,data_sit_especial
2135,2,BANCO DO BRASIL SA,VITORIA - ES,2,20051103,0,,,,2038,19660801,6421200,PRACA,PIO XII,30,,CENTRO,29010340.0,ES,5705,VITORIA,,,,,,,AGE0021@BB.COM.BR,10,0.0,5,0,,,,, 8338,2,BANCO DO BRASIL SA,CACHOEIRO DE ITAPEMIRIM-ES-EST UNIF,2,20051103,0,,,,2038,19660801,6421200,PRACA,JERONIMO MONTEIRO,26,,CENTRO,29300902.0,ES,5623,CACHOEIRO DE ITAPEMIRIM,,,,,,,,10,0.0,5,0,,,,, 11207,2,BANCO DO BRASIL SA,COLATINA-ES-EST.UNIF,2,20051103,0,,,,2038,19660801,6421200,RUA,EXPED ABILIO DOS SANTOS,124,,CENTRO,29700070.0,ES,5629,COLATINA,,,,,,,,10,0.0,5,0,,,,, 18643,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,RUA,PRESIDENTE VARGAS,29,,CENTRO,29400000.0,ES,5667,MIMOSO DO SUL,,,,,,,,10,0.0,5,0,,,,, 19615,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19820504,6421200,AVENIDA,SENADOR EURICO RESENDE,994,,CENTRO,29845000.0,ES,5619,BOA ESPERANCA,,,,,,,,10,0.0,5,0,,,,, 20974,2,BANCO DO BRASIL SA,SANTA TERESA ES-EST UNIF,2,20051103,0,,,,2038,19660801,6421200,RUA,JERONIMO VERVLOET,178,,CENTRO,29650000.0,ES,5691,SANTA TERESA,,,,,,,,10,0.0,5,0,,,,, 22241,2,BANCO DO BRASIL SA,SAO MATEUS ES EST UNIF,2,20051103,0,,,,2038,19660801,6421200,AVENIDA,JONES DOS SANTOS NEVES,324,,CENTRO,29930010.0,ES,5697,SAO MATEUS,,,,,,,,10,0.0,5,0,,,,, 28100,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,AVENIDA,JERONIMO MONTEIRO,38/46,,CENTRO,29500000.0,ES,5603,ALEGRE,,,,,,,,10,0.0,5,0,,,,, 37001,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,RUA,DEMERVAL AMARAL,35,,CENTRO,29560000.0,ES,5645,GUACUI,,,,,,,,10,0.0,5,0,,,,, '''
Here's a sample of the output when I pass the first empES.csv file:
''' cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
2135,2,WALDERY RODRIGUES JUNIOR,*025913,8,0.0,20200312,,,,,0 2135,2,ERIK DA COSTA BREYER,*093217,10,0.0,20191209,,,,,0 2135,2,THOMPSON SOARES PEREIRA CESAR,*503187,10,0.0,20191209,,,,,0 2135,2,MAURICIO NOGUEIRA,*894537,10,0.0,20191209,,,,,0 2135,2,DANIEL ANDRE STIELER,*145110,10,0.0,20190910,,,,,0 2135,2,ENIO MATHIAS FERREIRA,*078106,10,0.0,20181107,,,,,0 2135,2,RONALDO SIMON FERREIRA,*685018,10,0.0,20190729,,,,,0 2135,2,IVANDRE MONTIEL DA SILVA,*975660,10,0.0,20190403,,,,,0 2135,2,FABIO AUGUSTO CANTIZANI BARBOSA,*379967,10,0.0,20190403,,,,,0 2135,2,CARLOS MOTTA DOS SANTOS,*876287,10,0.0,20190403,,,,,0 2135,2,CAMILO BUZZI,*569178,10,0.0,20190403,,,,,0 '''
And here's what happens when I try to use the other "empES.csv" file:
''' j,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres '''
...and goes like this forever.
I have no clue why the first one goes fine through the code and why the second gives that output, it's like the .isin is not iterating in that case!
Any thoughts?
ps: All the data presented here is public domain from Brazil's government.