Is there a better way to hash a permutation of a string that consist of any of the 128 legal ASCII characters?

Question

Given that any 2 strings which are a permutation of each other are considered to be the same (eg. (ABACD, BDCAA) and (ABACD, DBACA) should be hashed to the same bucket of a HashMap). The strings only consists of any of the 128 legal ASCII characters. Is there a better hash function to minimise collision while keeping the HashMap small?

Also, is there any way to optimise the code even further? The main goal is to reduce the run time as much as possible.

The method takes in a file that contains a set of lines of text, each of which represents one entry. The first line in the file represents the total number of entries. It will calculate the total number of pairs of entries that contain an identical multiset.

An example of what the input file contain: 7 BCDEFGH ABACD BDCEF BDCAA DBACA DABACA DABAC

It should output: 6

The six pairs are: (ABACD, BDCAA) (ABACD, DBACA) (ABACD, DABAC) (BDCAA, DBACA) (BDCAA, DABAC) (DBACA, DABAC)

Part where hashing takes place:

long hash = 1;
while (c != 10) {
    hash *= PRIMES[c];
    c = reader.read();
}

import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.HashMap;

public class Speed {
    private static final int[] PRIMES = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
            73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173,
            179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
            283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409,
            419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541,
            547, 557, 563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659,
            661, 673, 677, 683, 691, 701, 709, 719};

    public int processData(String filename){
        try {
            Reader reader = new Reader(filename);
            int size = Integer.parseInt(reader.readLine());
            HashMap<Long, Integer> hm = new HashMap<>(size / 2);
            int total = 0;
            while (true) {
                int c = reader.read();
                if (c == -1)
                    break;
                long hash = 1;
                while (c != 10) {
                    hash *= PRIMES[c];
                    c = reader.read();
                }
                if (hm.get(hash) == null) {
                    hm.put(hash, 1);
                } else {
                    int value = hm.get(hash);
                    total += value;
                    hm.put(hash, value + 1);
                }
            }
            return total;
        } catch (Exception e) {
            System.out.println(e);
        }
        return 0;
    }

    static class Reader
    {
        final private int BUFFER_SIZE = 1 << 16;
        private DataInputStream din;
        private byte[] buffer;
        private int bufferPointer, bytesRead;

        public Reader(String file_name) throws IOException
        {
            din = new DataInputStream(new FileInputStream(file_name));
            buffer = new byte[BUFFER_SIZE];
            bufferPointer = bytesRead = 0;
        }

        public String readLine() throws IOException
        {
            byte[] buf = new byte[64]; // line length
            int cnt = 0, c;
            while ((c = read()) != -1)
            {
                if (c == '\n')
                    break;
                buf[cnt++] = (byte) c;
            }
            return new String(buf, 0, cnt);
        }

        private void fillBuffer() throws IOException
        {
            bytesRead = din.read(buffer, bufferPointer = 0, BUFFER_SIZE);
            if (bytesRead == -1)
                buffer[0] = -1;
        }

        private byte read() throws IOException
        {
            if (bufferPointer == bytesRead)
                fillBuffer();
            return buffer[bufferPointer++];
        }
    }

    public static void main(String[] args) {
        Speed dataProcessor = new Speed();
        int answer = dataProcessor.processData(args[0]);
        System.out.println(answer);
    }
}

Sort the characters in the `String`. Every permutation should sort to an identical set of characters. — Elliott Frisch, Mar 29 '20 at 16:39
@ElliottFrisch As the number of entries are huge (eg. up to a million) and that each string can be as long as a thousand characters, storing strings in the hashmap will be too space consuming and it is too slow to sort every string. My current method does it faster. However, I am looking to further optimise it even if it reduces it by milliseconds. Thanks for the answer though! — whwh, Mar 29 '20 at 16:44

Joni · Answer 1 · 2020-03-29T16:59:48.697

0

Your list of primes should start with 3, not 2. Multiplication by 2 leads to increased collisions: you lose one bit of data every time. After 64 of these characters the hash code is 0 regardless what other characters you have in the string.

As for the rest of the code, it's a lot easier to read a file line by line with BufferedReader. The overhead it adds is minimal.

edited Mar 29 '20 at 16:59

answered Mar 29 '20 at 16:53

Joni

108,737
14
143
193

Is there a better way to hash a permutation of a string that consist of any of the 128 legal ASCII characters?

1 Answers1