-1

I have a CSV file through which I am trying to load data into my SQL table containing 2 columns. I have 2 columns and the data is separated by commas, which identify the next field. The second column contains text and some commas in that text. Because of the extra commas I am not able to load data into my SQL table as it looks like it has extra columns. I have millions of rows of data. How can I remove these extra commas?

Data:

Number Address
"12345" , "123 abc street, Unit 345"
"67893" , "567 xyz lane"
"65432" , "789 unit, mno street"

I would like to remove the extra commas in the addresses in random rows.

Sara
  • 97
  • 1
  • 9
  • 2
    Possible duplicate of [Read CSV file with comma within fields in Python](https://stackoverflow.com/questions/8311900/read-csv-file-with-comma-within-fields-in-python) – pault Oct 23 '19 at 18:51
  • I don`t need to read the commas, i wanna replace the extra commas in the second column with space. – Sara Oct 23 '19 at 19:08
  • If you use a csv library, it will read that line as only having two columns. Are you using a csv library, or are you just splitting on a comma? – Bryan Oakley Oct 23 '19 at 19:30
  • I am splitting on a comma. Not using a library. – Sara Oct 23 '19 at 19:40
  • Read it with the comma, then replace after. – pault Oct 23 '19 at 19:53

2 Answers2

0

If all your data will be in the same format, as Number Address "000" , "000 abc street, Unit 000", you can split the list, remove the comma, and put the list back together, making it a string again. For example using the data you gave:

ori_addr = "Number Address \"12345\" , \"123 abc street, Unit 345\""
addr = ori_addr.split()
addr[6] = addr[6].replace(",", "")
together_addr = " ".join(addr)

together_addr is equal to "Number Address "12345" , "123 abc street Unit 345" note that there is no comma between "street" and "Unit."

unkn0wn.dev
  • 235
  • 2
  • 8
  • Not all the addresses are in the same format. They have commas in the address at random places. – Sara Oct 23 '19 at 19:39
  • Things would be quite a lot easier if somehow the addresses were all in the same format. Could you show some other examples of the format that the addresses could be in? – unkn0wn.dev Oct 23 '19 at 20:32
  • Not all the addresses have commas in them. Some don`t and some examples of address with commas are: Unit 123, 456 Street or Street 789 , abc road etc. – Sara Oct 23 '19 at 20:37
  • Sorry but you would need some quite complex code if you were trying to perform this task without a static format. Maybe try to save them all as a certain format so you can apply certain operations easily to them. – unkn0wn.dev Oct 23 '19 at 22:10
-1

Edits:

  • Following user's comments, added a failing address to this test. This address loads to the database without issue.
  • Added code to store CSV addresses into MySQL.

Answer:

The code below performs the following actions:

  • MySQL database engine (connection) created.
  • Address data (number, address) read from CSV file.
  • Non-field separating commas replaced from source data, and extra whitespace removed.
  • Edited data fed into a DataFrame
  • DataFrame used to store data into MySQL.
    import csv
    import pandas as pd
    from sqlalchemy import create_engine

    # Set database credentials.
    creds = {'usr': 'admin',
             'pwd': '1tsaSecr3t',
             'hst': '127.0.0.1',
             'prt': 3306,
             'dbn': 'playground'}
    # MySQL conection string.
    connstr = 'mysql+mysqlconnector://{usr}:{pwd}@{hst}:{prt}/{dbn}'
    # Create sqlalchemy engine for MySQL connection.
    engine = create_engine(connstr.format(**creds))

    # Read addresses from mCSV file.
    text = list(csv.reader(open('comma_test.csv'), skipinitialspace=True))

    # Replace all commas which are not used as field separators.
    # Remove additional whitespace.
    for idx, row in enumerate(text):
        text[idx] = [i.strip().replace(',', '') for i in row]

    # Store data into a DataFrame.
    df = pd.DataFrame(data=text, columns=['number', 'address'])
    # Write DataFrame to MySQL using the engine (connection) created above.
    df.to_sql(name='commatest', con=engine, if_exists='append', index=False)

Source File (comma_test.csv):

"12345" , "123 abc street, Unit 345"
"10101" , "111 abc street, Unit 111"
"20202" , "222 abc street, Unit 222"
"30303" , "333 abc street, Unit 333"
"40404" , "444 abc street, Unit 444"
"50505" , "abc DR, UNIT# 123 UNIT 123"

Unedited Data:

['12345 ', '123 abc street, Unit 345']
['10101 ', '111 abc street, Unit 111']
['20202 ', '222 abc street, Unit 222']
['30303 ', '333 abc street, Unit 333']
['40404 ', '444 abc street, Unit 444']
['50505 ', 'abc DR, UNIT# 123 UNIT 123']

Edited Data:

['12345', '123 abc street Unit 345']
['10101', '111 abc street Unit 111']
['20202', '222 abc street Unit 222']
['30303', '333 abc street Unit 333']
['40404', '444 abc street Unit 444']
['50505', 'abc DR UNIT# 123 UNIT 123']

Queried from MySQL:

number  address
12345   123 abc street Unit 345
10101   111 abc street Unit 111
20202   222 abc street Unit 222
30303   333 abc street Unit 333
40404   444 abc street Unit 444
50505   abc DR UNIT# 123 UNIT 123

Acknowledgement:

This is a long-winded approach. However, each step has been broken down intentionally to clearly show the steps involved.

S3DEV
  • 8,768
  • 3
  • 31
  • 42
  • This didn't work for me. I am still getting the error when i load the data in my sql that this particular row contain more data than their are input columns in my data. – Sara Oct 23 '19 at 20:39
  • It is saying more data because for addresses with comma in between it is reading it as multiple columns instead of one column. – Sara Oct 23 '19 at 20:40
  • @Sara - Check the offending rows of data. This code replaces all commas, so I'm not convinced that's the issue. Please post the offending rows in your original question, then we can write the code to consider these cases as well. To be fair, it works on the example you have given. – S3DEV Oct 23 '19 at 21:04
  • This is one of lines it giving error at: "abc DR, UNIT# 123 UNIT 123". I think it is not saving back the result in csv after replacing it . – Sara Oct 23 '19 at 21:18
  • @Sara - Comment acknowledged. I have added the failing address to the test and included code for storing your data into MySQL. Additionally, this address likely fails as there is no 'number', it's only the 'address' field. This is an excellent example of why it's **very important** to sanitise your source data before starting any work on it. – S3DEV Oct 24 '19 at 08:28