0

I have an array with some students in who have enrolled on a course. There are multiple duplicates and should be only one student per course.

Example array:

'item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'
'item_id'=> 3, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=>''
'item_id'=> 4, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'
'item_id'=> 6, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''
'item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>''

Data is accessed via API (otherwise this whole thing would be a simple SQL query!).

The raw data looks like below:

object(PodioItemCollection)#287 (5) { ["filtered"]=> int(45639) ["total"]=> int(45639) ["items"]=> NULL ["__items":"PodioCollection":private]=> array(10) { [0]=> object(PodioItem)#3 (5) { ["__attributes":"PodioObject":private]=> array(16) { ["item_id"]=> int(319357433) ["external_id"]=> NULL ["title"]=> string(12) "Foo Bar" ["link"]=> string(71) "https://podio.com/foo/enrolments/apps/applications/items/123" ["rights"]=> array(11) ...

The challenge is that I can't just use array_unique or similar because i need to:

  1. Find all the duplicates for a student + course
  2. Evaluate the found duplicates against each other and retain the item with the most amount of supplementary information (or merge them)
  3. Obtain the un-needed "item_id" for the duplicates and use the API to delete the items.

Further constraints:

  • I have no control over the API.
  • There are 44,000 records
  • There could be as many as 100 duplicates per person + course
  • The API returns a nested hierarchy of objects, so 44,000 records uses 27GB of RAM (the server has 144GB to play with) and yes php_memory limit is set to a ridiculous level!!! This is a single project and measures will be taken to correct the server variables afterwards.
  • Because of the large RAM usage things such as array_intersect are going to be a less popular choice

The final output should be:

    'item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''
    'item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'
    'item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'
    'item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>''

But i also need access to 'item_id's 3,4,6 so i can call a delete routine via an API.

Any ideas how to tackle this multi-duplicate mess?

kenorb
  • 155,785
  • 88
  • 678
  • 743
Rucia
  • 249
  • 1
  • 9
  • Sorry, but that won't work. I need to find the duplicates with the most information and then generate a list of the items i don't want in order to delete them. – Rucia Sep 10 '15 at 00:20
  • Use an `if statements` – aldrin27 Sep 10 '15 at 00:24
  • The student property serves as an id per course, i.e. in your example it's absolutely clear that the entries with item_id=2 and item_id=4 refer to the same `Sam`? Can there be conflicting information, e.g. two `Sam/Learning Guitar` records both having an address but the addresses differ? What is the actual response format of the server/api? (Maybe it still can be processed sequentially or at least in chunks.) – VolkerK Sep 10 '15 at 00:37
  • Regarding "challenge #2 ... (or merge them)": So, merging and therefore updating one of the records via the api is an option? – VolkerK Sep 10 '15 at 00:40
  • and another one: are a) all properties always present and "missing" just means its value is `===''` or are b) properties "really" missing, i.e. not present at all? ( Your example suggest a) ) – VolkerK Sep 10 '15 at 00:45
  • Is example array the right format? It isn't the valid PHP code. Or you mean each array per line, but on the other hand you say the API returns a nested hierarchy of objects. Can you give some sample of the actual input data format? – kenorb Sep 10 '15 at 00:46
  • @VolkerK thankfully conflicting information is not a concern in this case. records with the most amount of information are the ones to keep. I know from a purist aspect that validation of address for example should occur. But there's enough to contend with here! Keys are always present but sometimes empty. – Rucia Sep 10 '15 at 01:08
  • @kenorb the server returns a large sized array of objects. There is an enforced limit of 100 items per filtered API call, so to obtain all 44,000 records i have to loop and obtain just enough information from the returned array of objects in order to perform my task as the 44,000 records consumes 27GB of server RAM. The sample from the server is huge and unwieldy, also i hit the API rate limit and have to wait for an hour in order to run another query (8 minutes left). I will provide a link to a var_dump for you – Rucia Sep 10 '15 at 01:13
  • A var_dump of the _raw data_? Has the format a name? Is it e.g. ....json? – VolkerK Sep 10 '15 at 01:15
  • @VolkerK here is a link to the example var_dump of the raw data from the server. This is an array of objects and there are 10 items in the array. http://52.19.141.252/example_var_dump.txt – Rucia Sep 10 '15 at 01:37
  • That's actually not the raw data (yes, apparently it would be json) but a dump of the object model/representation of the response. Podio, never heard of it. And the unformatted dump is way to massive ...for now. – VolkerK Sep 10 '15 at 02:36
  • @VolkerK, yes the var_dump is huge, and its only for 10 records. It weighs in at around 66kb txt file for this many records :( Podio is a form based project management and collaboration tool. – Rucia Sep 10 '15 at 02:37

2 Answers2

1

It looks like you're trying to fix XY problem by parsing all the items at the same time, when actually you just need to read about Podio API and how to work with PodioObjects using their API.

So instead of traversing through the whole array, traverse through the object (PodioItemCollection in this case) by sorting it via key field (such as student), e.g.:

PodioItem::filter(31060, array('limit' => 20, 'offset' => 20, 'sort_by' => 'student'));

Then (assuming the sort criteria) you know that if there are some duplicates (student + course), they're in consecutive order. So as far you've the same student name in the next row, treat it as duplicate and ignore the other rows (unless you want to process them as part of the current student object), then go to the next item.

In this case, you can avoid to work with the all records at the same, so you won't have any memory issues.

If the memory issue would be still a problem, use NOSQL solution such as memcached or redis, so store the parsed items in the memory, free-up the variables from PHP, go to the next one, and at the end check the consistency (e.g. count the items). Consider also disabling any extra PHP extensions which you're having and could slow down the processing (such as xdebug and similar).

See also:

Community
  • 1
  • 1
kenorb
  • 155,785
  • 88
  • 678
  • 743
  • "If the memory issue would be still a problem [...]" - and if it still is: The underlying transport format seems to be json and there are streaming json parsers out there, the complete document doesn't _have to_ be in memory and parsed in one big step. But granted, depending on the complexity of the response this might be overshooting it for a one-off job ;-) – VolkerK Sep 10 '15 at 11:23
  • @VolkerK, this particular cleanse is a one-off, though I know there will be upcoming projects which will again need to address the full list of enrolments, but not have to deduplicate them. I will investigate the possibility of introducing a JSON middleware layer to stream the data. Thanks for the additional tip! – Rucia Sep 10 '15 at 21:27
0

Following function will do the job for you:

$apiData = array(
   array('item_id'=> 1, 'student'=> 'Bob', 'course'=> 'Learn Piano', 'address'=>''),
   array('item_id'=> 2, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=> 'foo street'),
   array('item_id'=> 3, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=>''),
   array('item_id'=> 4, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''),
   array('item_id'=> 5, 'student'=> 'Bob', 'course'=> 'Learn Guitar', 'address'=> 'bla bla street'),
   array('item_id'=> 6, 'student'=> 'Sam', 'course'=> 'Learn Piano', 'address'=>''),
   array('item_id'=> 7, 'student'=> 'John', 'course'=> 'Learn Guitar', 'address'=>'')
);

function resolveDuplicate($apiData = null)
{
  if(!$apiData) return false;

  foreach ($apiData as $key => $arr) {
    $key = $arr['student'] . ':' . $arr['course'];
    if(!$newArr[$key]['address']){
       if($newArr[$key]) $itemIds[] = $newArr[$key]['item_id'];
       $newArr[$key] = $arr;
    }
    else{
       $itemIds[] = $arr['item_id'];
    }
  }

  if($newArr){
     foreach ($newArr as $value) {
       $finalArr[] = $value;
     }
  }

  $result['student']    = $finalArr;
  $result['duplicates'] = $itemIds;
  return $result;
}

$res = resolveDuplicate($apiData);
echo '<pre>';
print_r($res);

Output

Array
(
    [student] => Array
        (
            [0] => Array
                (
                    [item_id] => 1
                    [student] => Bob
                    [course] => Learn Piano
                    [address] => 
                )

            [1] => Array
                (
                    [item_id] => 2
                    [student] => Sam
                    [course] => Learn Piano
                    [address] => foo street
                )

            [2] => Array
                (
                    [item_id] => 5
                    [student] => Bob
                    [course] => Learn Guitar
                    [address] => bla bla street
                )

            [3] => Array
                (
                    [item_id] => 7
                    [student] => John
                    [course] => Learn Guitar
                    [address] => 
                )

        )

    [duplicates] => Array
        (
            [0] => 4
            [1] => 3
            [2] => 6
        )

)    
Beroza Paul
  • 2,047
  • 1
  • 16
  • 16
  • Awesome! If there were multiple fields to check for supplementary information, let's say there are 6 fields to check in order to ensure we have the record with the most amount of information, would you recommend just extending this if(!$newArr[$key]['address']){ particular line? – Rucia Sep 10 '15 at 02:00
  • Yes, you can add more conditions in there. – Beroza Paul Sep 10 '15 at 02:58