Given that any 2 strings which are a permutation of each other are considered to be the same (eg. (ABACD, BDCAA) and (ABACD, DBACA) should be hashed to the same bucket of a HashMap). The strings only consists of any of the 128 legal ASCII characters. Is there a better hash function to minimise collision while keeping the HashMap small?
Also, is there any way to optimise the code even further? The main goal is to reduce the run time as much as possible.
The method takes in a file that contains a set of lines of text, each of which represents one entry. The first line in the file represents the total number of entries. It will calculate the total number of pairs of entries that contain an identical multiset.
An example of what the input file contain: 7 BCDEFGH ABACD BDCEF BDCAA DBACA DABACA DABAC
It should output: 6
The six pairs are: (ABACD, BDCAA) (ABACD, DBACA) (ABACD, DABAC) (BDCAA, DBACA) (BDCAA, DABAC) (DBACA, DABAC)
Part where hashing takes place:
long hash = 1;
while (c != 10) {
hash *= PRIMES[c];
c = reader.read();
}
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.HashMap;
public class Speed {
private static final int[] PRIMES = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173,
179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409,
419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541,
547, 557, 563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659,
661, 673, 677, 683, 691, 701, 709, 719};
public int processData(String filename){
try {
Reader reader = new Reader(filename);
int size = Integer.parseInt(reader.readLine());
HashMap<Long, Integer> hm = new HashMap<>(size / 2);
int total = 0;
while (true) {
int c = reader.read();
if (c == -1)
break;
long hash = 1;
while (c != 10) {
hash *= PRIMES[c];
c = reader.read();
}
if (hm.get(hash) == null) {
hm.put(hash, 1);
} else {
int value = hm.get(hash);
total += value;
hm.put(hash, value + 1);
}
}
return total;
} catch (Exception e) {
System.out.println(e);
}
return 0;
}
static class Reader
{
final private int BUFFER_SIZE = 1 << 16;
private DataInputStream din;
private byte[] buffer;
private int bufferPointer, bytesRead;
public Reader(String file_name) throws IOException
{
din = new DataInputStream(new FileInputStream(file_name));
buffer = new byte[BUFFER_SIZE];
bufferPointer = bytesRead = 0;
}
public String readLine() throws IOException
{
byte[] buf = new byte[64]; // line length
int cnt = 0, c;
while ((c = read()) != -1)
{
if (c == '\n')
break;
buf[cnt++] = (byte) c;
}
return new String(buf, 0, cnt);
}
private void fillBuffer() throws IOException
{
bytesRead = din.read(buffer, bufferPointer = 0, BUFFER_SIZE);
if (bytesRead == -1)
buffer[0] = -1;
}
private byte read() throws IOException
{
if (bufferPointer == bytesRead)
fillBuffer();
return buffer[bufferPointer++];
}
}
public static void main(String[] args) {
Speed dataProcessor = new Speed();
int answer = dataProcessor.processData(args[0]);
System.out.println(answer);
}
}