0

I have a ".csv" file with multiple rows. The information is set like this:

GS3;724330300294409;50;BRABT;00147;44504942;01;669063000;25600;0
GS3;724330300294409;50;BRABT;00147;44504943;01;669063000;25600;0
GS3;724330300294409;50;BRABT;00147;44504944;01;669063000;25600;00004

I already receive information in rows (each file has almost 300000 rows). I'm sending this data to Kafka but I need to see the lines split into columns. For example:

Column1 Column2         Column3 Column4 Column5 Column6  Column7 Column8    Column9 Column10
GS3     724330300294409 50      BRABT   00147   44504942 01      669063000  25600   0
GS3     724330300294409 50      BRABT   00147   44504943 01      669063000  25600   0
GS3     724330300294409 50      BRABT   00147   44504944 01      669063000  25600   00004

I know the size for each value. For example:

3 (GS3)
15 (724330300294409)
2 (50)
5 (BRABT)
5 (00147)
8 (44504943)
2 (01)
10 (669063000)
5 (25600)
5 (0    )

I'm trying to do this through ksql on my Kafka Platform but I'm struggling. I'm new to python but it seems like a easier way to do this before I send data to Kafka.

I've been using Spooldir CSV Connector to send data to Kafka but each row is being set as a unique column on the topic.

I've used this to add ";" between data:

i = True
        for line in arquivo:
                if i: 
                        i = False
                        continue
                result = result + line[0:3].strip()+commatype+line[3:18].strip()+commatype+line[18:20].strip()+commatype+line[20:25].strip()+ ...

arquivo.close()
  • why do you need the sizes? you have `;` between columns. Maybe [stream-csv-data-in-kafka-python](https://stackoverflow.com/questions/62425642/stream-csv-data-in-kafka-python) helps? If its a dupe, mark it. – Patrick Artner Jan 22 '21 at 13:58
  • @PatrickArtner I've addet the ";' between the information but they're still in the same row. I've tried using Spooldir CSV Connector for Kafka but it understands every row as a single column – GABRIEL ANDRADE QUEIROZ Jan 22 '21 at 14:03
  • Have you just added the `;` for this question or is it present in the file? If it's in the file why not `line.split(";")`? – Alex Jan 22 '21 at 14:08
  • @Alex I've eddited the question to show how I used python to add ";" between data. So line.split(";") would split all data into the appropriate columns? – GABRIEL ANDRADE QUEIROZ Jan 22 '21 at 14:10
  • If you've already written files that are separated by `;` then I would use the answer posted to read the files into a pandas dataframe. – Alex Jan 22 '21 at 14:18

1 Answers1

1

If you accept that your column names start from Column0 (not Column1), you can call read_csv with sep=';' and a suitable prefix:

result = pd.read_csv('Input.csv', sep=';', header=None, prefix='Column', dtype='str')

Note that I passed dtype='str' because some columns of your input have leading zeroes which otherwise would be stripped.

This solution works regardless of the number of input columns, but the downside is that now all columns are of object type. Maybe you should convert some columns to other types.

The result is:

  Column0          Column1 Column2 Column3 Column4   Column5 Column6    Column7 Column8 Column9
0     GS3  724330300294409      50   BRABT   00147  44504942      01  669063000   25600       0 
1     GS3  724330300294409      50   BRABT   00147  44504943      01  669063000   25600       0 
2     GS3  724330300294409      50   BRABT   00147  44504944      01  669063000   25600   00004

Other option, to create column names just as you wish (starting from Column1), but possible only if you know the number of columns, is:

# Create the list of column names
names = [ f'Column{i}' for i in range(1, 11) ]
# Read passing the above column names
result = pd.read_csv('Input.csv', sep=';', names=names, dtype='str')
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Sorry if it's a dumb question but do I have to pass "result = pd.read_csv('Input.csv', sep=';', header=None, prefix='Column', dtype='str')" into a ".py" or is it a command line? – GABRIEL ANDRADE QUEIROZ Jan 22 '21 at 14:17
  • It is a *Python* command and I tested it using *Jupyter notebook*. But it is possible to invoke a *Python* interpretter from a command prompt, passing a source file. It must be a **file** not a single command, because first you have to *import pandas as pd* and then invoke *Pandas* methods. – Valdi_Bo Jan 22 '21 at 14:25