How to keep other columns when using panda groupby and lambda function?

Question

Trying to convert a csv to json, and group objects by ID, there are multiple items need to be grouped into a nested objects, for example a company can be under multiple categories or sub categories, Longitude and Latitude should be grouped into geolocation object etc. Here is sample of the csv file:

id,name,address,Latitude,Longitude,category,Sub_Category,description,Phone ,Website 50001,Fifth Avenue Company,"123 Address rd, city, state, country, postal",34.0522,118.2437,Consultants,Project Marketing," description text here, description text here description text here",111-111-1111,https://example.com 50001,Fifth Avenue Company,"123 Address rd, city, state, country, postal",34.0522,118.2437,Consultants,Project Manager,"description text here, description text here description text here",111-111-1111,https://example.com 50003,Metrics LTD,"123 Address rd, city, state, country, postal",47.6062,122.3321,Vendor,Adhesive,"description text here, description text here description text here",111-111-1111,https://example.com 50004,Metrics LTD,"123 Address rd, city, state, country, postal",47.6062,122.3321,Vendor,Sealant,"description text here, description text here description text here",111-111-1111,https://example.com 50005,International Company,"123 Address rd, city, state, country, postal",37.7749,122.4194,Manager,Construction Manager,"description text here, description text here description text here",111-111-1111,https://example.com

and here is what I am trying to accomplish :

  [{
 "id":50001,
 "name":"Fifth Avenue Company",
 "city":"Seattle",
 "address":[{
            "street":"123 Address rd", 
            "city": "cityName", 
            "state/province": "stateName", 
            "country": "countryName", 
            "Zip/postal": "ZipValue"
             }],
 "category": [
        
     {
            "cat_name": Consultant,
           
        }
    ],
   "Subcategory": [
        
     {
            "Subcat_name or id": Project Marketing,
            "Subcat2_name or id": Project Manager,
           
        }
    ],
 "_geolocation": [
        
     {
            "Latitude": 34.0522,
            "Longitude": 118.2437
        }
    ],

 "type":"multi-family new",
 "description":"description text here",
 // otherColumns ..88
 ]

What I've tried so far in a google colab notebook:

[0] from google.colab import files
     uploaded = files.upload()

[3] import pandas as pd
import io
[4]df = pd.read_csv(io.StringIO(uploaded['companyfile.csv'].decode('utf-8')))
    df
//simplified code
[5] j = (df.groupby(['id']).groups
.apply(lambda x:x[['Latitude','Longitude']].to_dict('records'))
       .reset_index()
       .rename(columns={ 0:'_geolocation'})
       .to_json(orient='records'))

This code works but only return the ids, except the fact there are about 100 column of data for each id need to be displayed after grouping, so if I do the following:

j = (df.groupby(['id', 'name', 'address','Phone','Website', 
     'the rest of the columns'])
    .apply(lambda 
           x:x[['Latitude','Longitude']].to_dict('records'))
           .reset_index()
           .rename(columns={ 0:'_geolocation'})
           .to_json(orient='records'))

I get an error :

ValueError: cannot insert applicant, already exists

There are no duplicate column heads, however there are name driven columns for example applicant, applicant_id, applicant_fullName, applicant_phone

I tried other suggestions but Im stuck on this one hope anyone can help

How to keep other columns when using panda groupby and lambda function?

0 Answers0