3

I'm trying to sort a set of data so that it looks like a histogram of a probability distribution function (I'm assuming normally distributed for the moment).

I have a list of entries:

private static final class SortableDatasetEntry{
    Number value;
    Comparable key;
    public SortableDatasetEntry(Number value, Comparable key){
      this.value = value;
      this.key = key;
    }
}

An example: I have the items : {1,2,3,4,5,6,7,8,9}

EDIT: The sorted list I would like: {1,3,5,7,9,8,6,4,2} (or something similar) The numbers will not always be so neat (i.e. simply sorting by odd/even wont work either). I have a partial solution that involves sorting by regular order (lowest to highest) then copying that list to another by inserting into the middle each time, thus the last item inserted (into the middle) is the largest. I'd still like to find a method of doing this with a comparator.

This is quite tricky because its not being sorted by the absolute value of value but by the distance from the Mean(value) within the set, and then somehow moved so those values closest to mean are centered. I know that the compareTo function must be "reversible" (I forget the correct term).

Bonus points: How do I determine the correct distribution for the data (i.e. if it isn't normal, as assumed).

Zack Newsham
  • 2,810
  • 1
  • 23
  • 43
  • Can you give a by-hand example with like 10-15 entries? – Mshnik Apr 24 '15 at 15:54
  • Do you mean reflexive? Also, what you have shown is a constructor that initializes the fields of a class. Is that all the code you want to share? – Chetan Kinger Apr 24 '15 at 15:54
  • @AndersonVieira You are correct, its not the mean I want - ignore the second part of the question. The first part is correct, I want the PDF of the list. – Zack Newsham Apr 24 '15 at 16:18
  • @Mshnik, will make an edit now – Zack Newsham Apr 24 '15 at 16:18
  • @ZackNewsham - Do you want equal values to be grouped together in the result? E.g., if fed {1,2,2,2,2,2,9}, would you expect {1,9,2,2,2,2,2}? Or are you simply looking for the left side to be the values at ascending odd indexes, and the right side to be the values at descending even indexes - {1,2,2,9,2,2,2}? – Andy Thomas Apr 24 '15 at 17:47
  • @AndyThomas open to your suggestions here, I think either or is valid. Though I would prefer it if they were split evenly on either side. – Zack Newsham Apr 24 '15 at 20:42
  • @ZackNewsham - Okay, I've added an answer reflecting that preference and your edit. – Andy Thomas Apr 24 '15 at 22:06

6 Answers6

1

First calculate the mean and store it in a variable called say mean. Next, when you insert the entries into your SortableDatasetEntry, use value - mean as the actual value for each entry rather than value.

reservoirman
  • 119
  • 1
  • 11
  • Close, but this will sort them by their order from mean, this won't put mean in the middle. Though I suppose removing the abs would do it – Zack Newsham Apr 24 '15 at 16:07
  • 2
    @ZackNewsham Wouldn't removing the `Math.abs` be equivalent to sorting by value? I believe that `Double.compare(v1 - mean, v2 - mean) == Double.compare(v1, v2)`. – Anderson Vieira Apr 24 '15 at 16:12
  • no because now any value that = mean (or close to it) will be centered. I ended up cheating, and just sorting by absolute order, then copying the array by inserting into the middle each time, so the last item inserted is the value with the highest probability of being selected. – Zack Newsham Apr 24 '15 at 16:16
  • Ok but now you need double the amount of memory :) If your dataset is not large to begin with not a big deal I guess. – reservoirman Apr 24 '15 at 16:21
  • @reservoirman I agree, its a poor mans solution - still hoping someone will come up with something better, see edit – Zack Newsham Apr 24 '15 at 16:22
  • @ZackNewsham I think it is impossible using only comparator. How to determine, should element go to the left or to the right part of distribution, only by its value? – Alex Salauyou Apr 24 '15 at 16:50
  • @SashaSalauyou I'm ok with randomness, as long as the middle value is the highest, and outside edge values are lowest. I think something like having an additional field would work. – Zack Newsham Apr 24 '15 at 16:56
  • @reservoirman, just realised it doesnt double memory, as I remove one item from the list as I insert it into the other. Also, its only references anyway, so wouldnt be doubling the memory – Zack Newsham Apr 24 '15 at 16:56
  • @ZackNewsham yes, just one additional field (say, `r`) would help. Randomly assign it to -1 or 1 when obtaining elements from the source. In comparator, compare by `r` first, if they are equal, compare by `distance * r`. – Alex Salauyou Apr 24 '15 at 16:59
0

For what I see, you probably want to get a tuple of "mean distance", value and sort the tuple list with the first entry "mean distance".

JFPicard
  • 5,029
  • 3
  • 19
  • 43
0

You would find it much easier to build your histogram as a Map.

public static Map<Integer, List<Number>> histogram(List<Number> values, int nBuckets) {
    // Get stats on the values.
    DoubleSummaryStatistics stats = values.stream().mapToDouble((x) -> x.doubleValue()).summaryStatistics();
    // How big must each bucket be?
    int bucketSize = (int) (stats.getMax() - stats.getMin()) / nBuckets;
    // Roll them all into buckets.
    return values.stream().collect(Collectors.groupingBy((n) -> (int) ((n.doubleValue() - stats.getMin()) / bucketSize)));
}

Note the intent of a Histogram

To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of small intervals—and then count how many values fall into each interval.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
  • It looks like you're building a histogram of the values. But the OP appears to be trying to use the values as the counts for the histogram. – Andy Thomas Apr 24 '15 at 18:01
  • @AndyThomas - My concern is that if OP has bucket counts then they should not need sorting, they are in order already - if OP has values then they should be bucketed first. Sorting them assuming some Gaussian distribution is probably wrong. A histogram is not a value plot it is a count-per-bucket plot. – OldCurmudgeon Apr 24 '15 at 19:46
  • Good points if the OP is trying to build a histogram. The explicit request was for a small/large/small sort that *looks like a histogram.* Not clear why they want this order. – Andy Thomas Apr 24 '15 at 20:17
0

Would something like:

 public List<Integer> customSort(List<Integer> list) {
    Collections.sort(list);
    List<Integer> newList = new ArrayList<Integer>();
    for (int i = 0; i < list.size(); i += 2) {
        newList.add(list.get(i));
    }
    if (list.size() % 2 == 0) {
        for (int i = 1; i < list.size(); i += 2) {
            newList.add(list.get(list.size() - i));
        }
    } else {
        for (int i = 1; i < list.size(); i += 2) {
            newList.add(list.get(list.size() - i - 1));
        }
    }
    return newList;
}

work? I put in {1,2,3,4,5,6,7,8,9} and get {1,3,5,7,9,8,6,4,2}, and {1,2,3,4,5,6,7,8} gives {1,3,5,7,8,6,4,2}.

M. Shaw
  • 1,742
  • 11
  • 15
0

You cannot accomplish this in a single sort merely with a custom Comparator.

However, it is still be feasible to do it in-place, without an additional collection of references.

Your current approach is not in-place, but is probably the easiest to implement and understand. Unless the size of the collection in memory is a concern, consider staying with your current approach.

Custom comparator in a single sort

Your desired order depends on the ascending order. Given unsorted data, your Comparator doesn't have the ascending order while the first sort is occurring.

In-place approaches

You could create your desired order in-place.

What follows presumes 0-based indices.

One approach would use two sorts. First, sort in ascending order. Mark each object with its index. In the Comparator for the second sort, all objects with even indices will be less than all objects with odd indices. Objects with even indices will be ordered in ascending order. Objects with odd indices will be ordered in descending order.

Another approach would be a custom sorting algorithm that supported mapping from virtual to physical indices. The sorting algorithm would create an ascending order in the virtual index space. Your index mapping would lay it out in physical memory in the order you desire. Here's an untested sketch of the index mapping:

private int mapVirtualToPhysical( int virtualIndex, int countElements ) {
    boolean isEvenIndex = ( 0 == (index % 2));
    int physicalIndex = isEvenIndex ? (index / 2) : (countElements - (index/2) - 1);
    return physicalIndex;
}

Preferable to either of these would be an initial sort followed by an O(n) series of swaps. However, I haven't yet determined the sequence of swaps. The best I've come up with so far gets the left tail in order, but the right tail either requires a subsequent sort or a stack.

Andy Thomas
  • 84,978
  • 11
  • 107
  • 151
0

For large sets of data, you can use the approach when SortableEntry constructor determines, which side of chart (left or right to the highest) this particular entry will occupy, using random number generator:

static final class SortableEntry<T>{

    Number value;
    Comparable<T> key;
    int hr;
    static Random rnd = new Random();

    public SortableEntry(Number value, Comparable<T> key){
        this.value = value;
        this.key = key;
        this.hr = rnd.nextInt(2) == 0 ? -1 : 1;  // here
    }
}

The point of additional hr variable is to make any "right" entry be greater than any "left" and vice versa. If hr of two compared entries are the same, compare by actual key, taking into account sign of hr:

static final class SortableEntryComparator<T> implements Comparator<SortableEntry<T>> {

    @Override
    public int compare(SortableEntry<T> e1, SortableEntry<T> e2) {
        if (e1.hr == e2.hr) 
            return e1.hr < 0 ? e1.key.compareTo((T) e2.key) : e2.key.compareTo((T) e1.key);
        else 
            return e1.hr - e2.hr;
    }
}

Now a small test:

@Test
public void testSort() {
    List<Integer> keys = Arrays.asList(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 
                                       12, 25, 31, 33, 34, 36, 39, 41, 26, 49,
                                       52, 52, 58, 61, 63, 69, 74, 83, 92, 98);
    List<SortableEntry<Integer>> entries = new ArrayList<>();
    for (Integer k : keys) {
        entries.add(new SortableEntry<Integer>(0, k)); 
    }
    entries.sort(new SortableEntryComparator<Integer>());
    System.out.println(entries);
}
// output: 
// [12, 26, 33, 36, 39, 40, 49, 50, 52, 60, 61, 63, 80, 90, 98, 100, 92, 83, 74, 70, 69, 58, 52, 41, 34, 31, 30, 25, 20, 10]
// the highest key (100) is not precisely in the center,
// but it will tend to occur in the center when dataset is large.
Alex Salauyou
  • 14,185
  • 5
  • 45
  • 67
  • Close, but sadly its possible to randomly assign all the low values (or possibly all the values) to one side of the graph, then this falls appart. – Zack Newsham Apr 26 '15 at 19:11
  • @zack **For large sets of data**. I should repeat it again. For large sets of data. For large sets of data. First you say it's OK to have some randomness, now you want it to be strictly symmetrical. – Alex Salauyou Apr 26 '15 at 20:00
  • Randomness is OK, I don't care which side of the graph each element is on as long as long as it is roughly symmetrical. However, assigning all the large values to one side of the graph is not OK. I guess this is a misunderstanding of what randomness means. – Zack Newsham Apr 26 '15 at 20:14