I have an array with some students in who have enrolled on a course. There are multiple duplicates and should be only one student per course.
Example array:
'item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'
'item_id'=> 3, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=>''
'item_id'=> 4, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'
'item_id'=> 6, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>''
Data is accessed via API (otherwise this whole thing would be a simple SQL query!).
The raw data looks like below:
object(PodioItemCollection)#287 (5) { ["filtered"]=> int(45639) ["total"]=> int(45639) ["items"]=> NULL ["__items":"PodioCollection":private]=> array(10) { [0]=> object(PodioItem)#3 (5) { ["__attributes":"PodioObject":private]=> array(16) { ["item_id"]=> int(319357433) ["external_id"]=> NULL ["title"]=> string(12) "Foo Bar" ["link"]=> string(71) "https://podio.com/foo/enrolments/apps/applications/items/123" ["rights"]=> array(11) ...
The challenge is that I can't just use array_unique or similar because i need to:
- Find all the duplicates for a student + course
- Evaluate the found duplicates against each other and retain the item with the most amount of supplementary information (or merge them)
- Obtain the un-needed "item_id" for the duplicates and use the API to delete the items.
Further constraints:
- I have no control over the API.
- There are 44,000 records
- There could be as many as 100 duplicates per person + course
- The API returns a nested hierarchy of objects, so 44,000 records uses 27GB of RAM (the server has 144GB to play with) and yes php_memory limit is set to a ridiculous level!!! This is a single project and measures will be taken to correct the server variables afterwards.
- Because of the large RAM usage things such as array_intersect are going to be a less popular choice
The final output should be:
'item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'
'item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'
'item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>''
But i also need access to 'item_id's 3,4,6 so i can call a delete routine via an API.
Any ideas how to tackle this multi-duplicate mess?