-1

my project uses AWS Chalice framework and sqlalchemy.

Here is a sample table from the project.

Person Table

I am required to get data from table in two formats. I require two separate queries which should give the data in different format as both will have separate apis. Please do not combine the query to give data in both formats together as dataset is huge & there will be performance issues and as already mentioned I need them separately.

First Format:

I want to get count of person for different jobs grouped by city and if the job field is empty/none, then it should be treated as No Job.

{
   "Bangalore":{
      "Engineer": 2,
      "No Job": 2,
      "Paramedic": 1,
   },
   "Delhi":{
      "Architect": 2,
      "Doctor": 1,
   },
   "Mumbai":{
      "Doctor": 2,
      "Engineer": 1,
      "No Job": 1,
   }
}

Second Format:

I want to get count of unique job titles and person grouped by city.

{
   "Bangalore":{
      "job_titles": 3,
      "persons": 5,
   },
   "Delhi":{
      "job_titles": 2,
      "persons": 3,
   },
   "Mumbai":{
      "job_titles": 3,
      "persons": 4,
   }
}

Explanation:

  • Bangalore has 2 Enginners, 1 Paramedic and 2 person with no job, so there are 3 unique job_titles & 5 Persons.
  • Delhi has 2 Architect and 1 Doctor, so there are 2 unique job_titles & 3 person.
  • Mumbai has 2 Doctors, 1 Enginner and 1 person with no job, so there are 3 unique job_titles & 4 Persons.

Currently, I am achieving this by running a simple Select * from table type query, and iterating over it to form nested dictionaries (in Python). And the issue is that this processing takes a long time.

I am not sure if data in this type of formats even achievable. So, any query whose data can be converted easily and quickly(wrt time) is appreciated.

I have been stuck on this issue for over a month now, so any type of help is appreciated. But please try to provide answer in sqlalchemy query.

Thanks

Ashutosh
  • 209
  • 3
  • 9

1 Answers1

1

This is like basics of SQL - count and group by. Looks like a job interview question.

select city, job, count(*) as count from person_jobs group by city, job;

And in Python:

cities_jobs = collections.defaultdict(dict)
for city, job, count in query_result:
  cities_jobs[city][job] = count
Tometzky
  • 22,573
  • 5
  • 59
  • 73
  • thanks for your answer. Actually I am also using a similar query and python code in my project. And this approach takes a significant amount of time (~9 secs). Is there any way to make this query faster ? Also is it possible to get the result in the required formats mentioned above without having to iterate over the query result ? – Ashutosh Feb 05 '22 at 17:55
  • 1
    If it "takes a significant amount of time" then you must have tens of millions of records. The only way to make it faster I can think of is to maintain a separate table with summaries using trigger. As for "is it possible to get the result in the required formats" - yes it is, but it is way more complicated, so it does not make sense to try. – Tometzky Feb 06 '22 at 09:56