1

I "glue" together (with a help of other SO users) a small C program that maps string labels to integer labels in a edgelist data structure. For instance, for input file

Mike Andrew
Mike Jane
John Jane

the program outputs

1 2
1 3
4 3

However, I mapped huge edgelist files and unfortunately the program runs very slow in comparison to Python alternative. Below I pasted both programs, in C and Python. I kindly ask for a pointers how to improve speed of the C program.

#include <stdio.h>
#include <stdlib.h>

// Initial number of maximal lines in a file
enum { MAXL = 200};

typedef struct {
    unsigned int first;
    unsigned int second;
} edge;

typedef struct {
    unsigned int hashed;
     char **map;
} hash;


int insertInMap(hash *map, char *entry)
{
  int i =0;
  for (i=0;i<map->hashed;i++)
  {
    if (strcmp(map->map[i],entry) == 0)
    return i+1;
  }
  /* Warning no boundary check is added */
  map->map[map->hashed++] = strdup(entry);   
  return map->hashed;
}

int main() {
  FILE *fp = NULL;
  char node1[30];
  char node2[30];
  int idx = 0;
  int i, n = 0, maxl = MAXL;

  edge *edges;
  hash map;

  edges = malloc(MAXL * sizeof(edge));
  map.map = malloc(MAXL * sizeof(char*));
  map.hashed = 0;

  fp = fopen("./test.txt", "r");

  while (fscanf(fp, "%s %s", &node1, &node2) == 2) {
    if (++n == maxl) { /* if limit reached, realloc lines  */
      void *tmp = realloc (edges, (maxl + 40) * sizeof *edges);
      void *tmp1 = realloc (map.map, (maxl + 80) * sizeof(char*));
      if (!tmp) {     /* validate realloc succeeded */
        fprintf (stderr, "error: realloc - virtual memory exhausted.\n");
        break;      /* on failure, exit with existing data */
      }
      edges = tmp;    /* assign reallocated block to lines */

      map.map = tmp1;
      maxl += 40;     /* update maxl to reflect new size */
    }
    edges[idx].first = insertInMap(&map,node1);
    edges[idx].second = insertInMap(&map,node2);
    idx++;
  }

  fclose(fp);

  for (int i = 0; i < idx; i++) {
    printf("%d -- %d\n", edges[i].first, edges[i].second);
  }


  free(edges);

  return 0;
}

The corresponding Python alternative:

import fileinput

i = 0
cui2int = {}

for line in fileinput.input():    
    (cui1, cui2) = line.split()
    if cui1 in cui2int:
        int1 = cui2int[cui1]
    else:
        i += 1
        cui2int[cui1] = i
        int1 = i

    if cui2 in cui2int:
        int2 = cui2int[cui2]
    else:
        i += 1
        cui2int[cui2] = i
        int2 = i

    print(int1, int2)

EDITED AND ADDED

Below is modified code using the GLib hash implementation. I improved performance, but unfortunately still have troubles with the output which should be

1 2
1 3
4 3

instead of

0 0
0 1
1 1

Could somebody please take a look.

#include <stdio.h>
#include <stdlib.h>
#include <glib.h>
#include <stdint.h>

int main() {
  GHashTable *table;
  table = g_hash_table_new(g_int_hash, g_int_equal);

  FILE *fp = NULL;
  char node1[30];
  char node2[30];

  fp = fopen("./test.txt", "r");
  int i = 0;
  while (fscanf(fp, "%s %s", &node1, &node2) == 2) {
    char *key1 = malloc(sizeof(char)*1024);
    char *key2 = malloc(sizeof(char)*1024);
    uint32_t* value = (uint32_t *)malloc(sizeof(uint32_t));
    key1 = g_strdup(node1);
    key2 = g_strdup(node2);
    *value = i;

    uint32_t *x;
    if (g_hash_table_contains(table, key1)) {
      x = (uint32_t *)g_hash_table_lookup(table, key1);
    } else {
      i++;
      g_hash_table_insert(table, (gpointer)key1, (gpointer)value);
      x = (uint32_t *)value;
    }

    uint32_t *y;
    if (g_hash_table_contains(table, key2)) {
      y = (uint32_t *)g_hash_table_lookup(table, key2);
    } else {
      g_hash_table_insert(table, (gpointer)key2, (gpointer)value);
      y = (uint32_t *)value;
    }
    printf("%d -- %d\n", *x, *y);
  }

  fclose(fp);

  g_hash_table_destroy(table);
  table = NULL;
  return 0;
}
Andrej
  • 3,719
  • 11
  • 44
  • 73
  • 7
    Because the Python version doesn’t do a linear search for each entry like the C version does. Even if there’s words like “hash” and “map” the C version is just an array of strings with linear search. It will be slow. – Sami Kuhmonen May 13 '19 at 04:57
  • 4
    You are using the word "hash" in your program but there is no actual hashing going on, just a linear search which is very slow. You need to find out what "hashing" means and implement a hash table. – n. m. could be an AI May 13 '19 at 04:57
  • Maybe time to try a [trie](https://stackoverflow.com/questions/56074081/map-edgelist-with-string-node-labels-to-integer-labels/56074600#56074600), as I said in my answer a few days ago. Easier to code, easier to maintain, easier to reuse :-). – m.raynal May 13 '19 at 08:43
  • 1
    @m.raynal Sure, but first with hash. – Andrej May 13 '19 at 17:08
  • I hope that I made an effort and re-implemented the initial program using the GLib hash table. Could somebody please take a look why the counting doesn't work correctly. – Andrej May 14 '19 at 07:32
  • 1
    the code posted for the first C example results in LOTS of critical warnings, When compiling, always enable the warnings, then fix those warnings. ( for `gcc`, at a minimum use: `-Wall -Wextra -Wconversion -pedantic -std=gnu11` ) Note: other compilers use different options to produce the same results. – user3629249 May 14 '19 at 13:57
  • the first C example is missing the statement: `#include ` for the functions: `strdup()` and `strcmp()` – user3629249 May 14 '19 at 13:59
  • 1
    OT: regarding: `uint32_t* value = (uint32_t *)malloc(sizeof(uint32_t));` 1) In C, The heap allocation functions: `malloc()`, `calloc()`, `realloc()` return type `void*` which can be assigned to any pointer. Casting just clutters the code, making it more difficult to understand, debug, etc. 2) Always check (!=NULL) the returned value to assure the operation was successful. – user3629249 May 14 '19 at 14:11
  • OT: regarding: `fp = fopen("./test.txt", "r");` Always check (!=NULL) the returned value to assure the operation was successful. – user3629249 May 14 '19 at 14:12
  • 1
    OT: regarding: `while (fscanf(fp, "%s %s", &node1, &node2) == 2) {` When using the input format specifiers '%s' and/or '%[...]' always include a MAX CHARACTERS modifier that is 1 less than the length of the input buffer because these specifiers always append a NUL byte to the input. This also assures there will not be a buffer overflow with the attendant undefined behavior – user3629249 May 14 '19 at 14:14
  • 1
    OT: regarding: `char *key1 = malloc(sizeof(char)*1024);` The expression: `sizeof(char)` is defined in the C standard as 1. Multiplying anything by 1 has no effect. Suggest removing the `sizeof(char)` expression from every call to a heap allocation function. – user3629249 May 14 '19 at 14:17

3 Answers3

6

Your "hash" in C operates more like a linked list, with linear insertion and lookup. On the other hand, Python's dictionary is industrial-strength with O(1) average insertions and lookups (the in operator). If you're writing a hashmap from scratch in C, there's a good deal of theory you're going to need to put into practice in order to begin approaching Python's implementation in terms of performance.

In my opinion, the best bet is to write your code in C++ if possible and use an unordered_map. This is the best of both worlds: all the work has already been done for you, yet you don't need to make performance compromises.

If you are set on (or stuck with) C, there are quite a few resources on the internet, but I'm hesitant to post any links here since I can't vouch for their quality. It should be an educational endeavor.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
6

The two programs are using fundamentally different data structures with different time complexity. The python program is using a dictionary which is a highly tuned hash- table with O(1) amortized performance for lookup and deletion.

So the python program is running with O(number of words) asymptotic complexity.

Now, talking about your C program what you have tried to create is essentially just an array of key-value pairs. Inserting or retrieving the keys takes O(size of the array) here since you can potentially loop through the array till the end to find a match.

If you do some math, it turns out to be O((number of words)2).

C++ has built-in Hash-table implementation named unordered_map, you can use that if you have no trouble switching to C++. Or check out this question on SO to learn writing your own hash table in C. What is a hash table and how do you make it in C?

1

The problem with your code is that despite the name, it isn't a working hash table. You chew through the map with a linear search which is very slow. What you should do instead:

  • Set the hash table size to a fixed size. Avoid any solution based on realloc.
  • Come up with a hashing function to determine table index. There should be plenty of code examples on the net using strings.
  • Implement a method for storing/checking an index. This could be storing in the next table index available, or by implementing "chaining", where each index is a linked list, etc.
Lundin
  • 195,001
  • 40
  • 254
  • 396