How to delete the repeated data in crossfilter?

Question

I have the following problem, I want to make a boxplot (with dc.js) per service (A, B, C, D) to represent (q1, q2, q3, q4 and outliers) the time each is delayed.

My data contains an id, category, the time it takes and other data, the problem is that I have is that I have repeated rows due to the other additional data that are important to have for other graphics.

For example,

Id / category / time / other data

1 / B / 2 / ...

155 / A / 51 / ..

156 / A / "NaN" / ..

157 / C / 10 / ..

etc

Before adding the additional data, I had no problem with the repeated data and used the following code.

var categorydim=ndx.dimension(function(d){return d["category"]});
var categorydim.group().reduce(
     function(p,v){
         if (v["time"]>0.){
         p.push(v["time"])};
         return p;
     },
     function(p,v){
         if (v["time"]>0.){
         p.splice(p.indexOf(v["time"]),1)};
         return p;
     },
     function(){
         return[];
     }
 )

But now I must for example stay with a single value of id 155. Do you have any idea to do it in crossfilter? Or with reductio.js?

How to exclude repeated data?

So the problem is that when you only had one `155`, you got a group `{ key: 'A', value: [51] }` but with the duplicates you get `{ key: 'A', value: [51, 51] }`? — Ethan Jewett, May 11 '17 at 14:49
In my case I fixed the issue by converting `value` to a string. I guess this means that at some point you need to do a `value.toString()`. — ivan quintero, Jun 12 '22 at 14:23

Ethan Jewett · Accepted Answer · 2017-05-11T20:35:42.953

1

Assuming I've understood the problem, you need to track the unique IDs you've already seen. Reductio does this for exception aggregation with sum and count, but not for your scenario, I believe. This or something like it should work. If you can put together a working example, I'll be happy to verify this code:

var categorydim=ndx.dimension(function(d){return d["category"]});
var categorydim.group().reduce(
     function(p,v){
         // Ignore record if time is invalid or key has already been added.
         if (v["time"]>0. && !p.keys[v['Id']]){
           p.values.push(v["time"])
           p.keys[v['Id']] = 1
         } else if(v["time"]>0.) {
           // Time is valid and key has shown up 1 or more times already
           p.keys[v['Id']]++
         }
         return p;
     },
     function(p,v){
         // Ignore record if time is invalid or key is the "last" of this key.
         if (v["time"]>0. && p.keys[v['Id']] === 1){
           p.values.splice(p.values.indexOf(v["time"]), 1)
           p.keys[v['Id']] = 0
         } else if(v["time"]>0.) {
           // Key is greater than 1, so decrement
           p.keys[v['Id']]--
         }
         return p;
     },
     function(){
         return {
           keys: {},
           values: []
         };
     }
 )

edited May 11 '17 at 20:35

answered May 11 '17 at 14:55

Ethan Jewett

6,002
16
25

Many thanks, worked perfect with the dc.js boxplot I had to add .valueAccessor (function (d) {return d.value.values} I thought I had to sort the values time to create the boxplot, but it is not necessary. Again thank you very much – aey May 11 '17 at 20:21
Hmmm. Actually, this is going to blow up on filtering because it is too quick to remove records. You actually need to keep a count of each key. I will try to update. – Ethan Jewett May 11 '17 at 20:28
I need the data not to be repeated, that is to say that there is a time stamp for each id – aey May 11 '17 at 20:33
Just updated. And yes, this will still do that. The important thing is that is needs to add the time to p.values only the *first* time it sees a key and remove it only for the *last* key that remains within the filter. So we have to keep a count of how many times we have seen each key. – Ethan Jewett May 11 '17 at 20:36
If you can try that out and validate that the answer is still correct, I would appreciate it. Don't want to make people frustrated by trying to implement code I haven't actually tested :) – Ethan Jewett May 11 '17 at 20:37
The code continues to work without problem, but I did not understand why the previous one was deficient. – aey May 11 '17 at 20:48
In the old version, if the records were not really identical and a filter was applied that left one 155 record in the filter and the other one out, then the time for 155 would have been removed from p.values even though it really should have stayed in because there was still a 155 record in the filter scope. In the new version it tracks exactly how many records with each key are in scope and only removes the time value when the last record exists the filter. Probably an invisible problem to you because you're records are actually identical, but it will trip up others. – Ethan Jewett May 11 '17 at 20:52

How to delete the repeated data in crossfilter?

1 Answers1