-1

I'm trying to parse a set of files that all have the same format. Here's an example instance:

NAME:  br17
TYPE: ATSP
COMMENT: 17 city problem (Repetto)
DIMENSION:  17
EDGE_WEIGHT_TYPE: EXPLICIT
EDGE_WEIGHT_FORMAT: FULL_MATRIX 
EDGE_WEIGHT_SECTION
 9999    3    5   48   48    8    8    5    5    3    3    0    3    5    8    8
    5
    3 9999    3   48   48    8    8    5    5    0    0    3    0    3    8    8
    5
    5    3 9999   72   72   48   48   24   24    3    3    5    3    0   48   48
   24
   48   48   74 9999    0    6    6   12   12   48   48   48   48   74    6    6
   12
   48   48   74    0 9999    6    6   12   12   48   48   48   48   74    6    6
   12
    8    8   50    6    6 9999    0    8    8    8    8    8    8   50    0    0
    8
    8    8   50    6    6    0 9999    8    8    8    8    8    8   50    0    0
    8
    5    5   26   12   12    8    8 9999    0    5    5    5    5   26    8    8
    0
    5    5   26   12   12    8    8    0 9999    5    5    5    5   26    8    8
    0
    3    0    3   48   48    8    8    5    5 9999    0    3    0    3    8    8
    5
    3    0    3   48   48    8    8    5    5    0 9999    3    0    3    8    8
    5
    0    3    5   48   48    8    8    5    5    3    3 9999    3    5    8    8
    5
    3    0    3   48   48    8    8    5    5    0    0    3 9999    3    8    8
    5
    5    3    0   72   72   48   48   24   24    3    3    5    3 9999   48   48
   24
    8    8   50    6    6    0    0    8    8    8    8    8    8   50 9999    0
    8
    8    8   50    6    6    0    0    8    8    8    8    8    8   50    0 9999
    8
    5    5   26   12   12    8    8    0    0    5    5    5    5   26    8    8
 9999
EOF

I'm wanting to lift out the dimension of the matrix and the matrix itself, everything else can be discarded. This is the code I'm currently using to try and parse it:

fp = fopen(argv[1] , "r");

for (i = 0; i < 3; ++i)
{
    fscanf(fp, "\n");
}
fscanf(fp, "%d", &size);
for (i = 0; i < 3; ++i)
{
    fscanf(fp, "\n");
}
cost = (double**) calloc(size, sizeof(double*));
for(i = 0 ; i < size; ++i){
    cost[i] = (double*) calloc(size, sizeof(double));
}

for(i = 0 ; i < size; ++i)
{
    for(j = 0 ; j < size; ++j)
    {
        fscanf(fp, "%lf", &(cost[i][j]));
    }
    cost[i][i] = 0;
}

fclose(fp);

(The file does seem to have newlines when I open it in a text editor - though not in Notepad - I don't know why they've vanished here. NAME, TYPE, COMMENT, DIMENSION, EDGE_WEIGHT_TYPE, EDGE_WEIGHT_FORMAT, and EDGE_WEIGHT_SECTION all appear starting new lines. EDIT: Ah, thanks, Yossarian. I'm a Stack Overflow newbie!)

Anyway, my code isn't working. Specifically, I've noticed through using a debugger that it isn't lifting the dimension of the matrix, which means that the attempt to read the matrix properly is doomed from the start. All variables are declared, that's not the problem. It's just not reading the number after dimension and assigning it to size. What am I doing wrong?

EDIT: I've tried Vicky's suggestion of fscanf(fp, "%s\n", buf); - which also has the advantage of letting me see where it is in the file by watching the value of buf - and discovered that it's taking one word at a time, not one line. Trouble with that approach is that the COMMENT: line isn't consistent in the number of words it has. Using "%*s" and "%*s\n" didn't write anything at all to buf.

EDIT 2: while((c = getchar()) != '\n' && c != 'EOF') ; just hangs the program. No idea what it's doing.

EDIT 3: while((c = getc(fp)) != '\n' && c != 'EOF') ; is going through the file line by line, but fscanf(fp, "%d", &size); still isn't picking up the number.

EDIT 4: Aha! Got it working with

char c; 
for (i = 0; i < 3; ++i)
{
    while((c = getc(fp)) != '\n' && c != 'EOF') ;
}
fscanf(fp, "%*s");
fscanf(fp, "%i", &size);

for (i = 0; i < 4; ++i)
{
    while((c = getc(fp)) != '\n' && c != 'EOF') ;
}

Thanks for the help, all!

Tam Coton
  • 786
  • 1
  • 9
  • 20
  • I don't think you can do `fscanf(fp, "\n");` to consume a line like that, can you? I think you need `fscanf(fp, "%s\n", buf);` (having first declared buf appropriately). – Vicky May 09 '13 at 08:39
  • 4
    Also, C is a terrible language for this type of text processing - do you have any option to use python, perl or another scripting language? – Vicky May 09 '13 at 08:39
  • 1
    If you want to "skip a line", why not use `int c; while((c = getchar()) != '\n` && c != EOF) ;` [make it a function called "skipline" or some such]. – Mats Petersson May 09 '13 at 08:48
  • 3
    kind of overkill. `fscanf(fp, "%*s");` will do better. (without any buffer) – nothrow May 09 '13 at 08:51
  • I'd much rather be using Python - I'm actually competent at it - but I need the better speed of C for the calculations I'm doing with the data, and I have no idea how to go about interfacing C and Python. Thanks for the pointers though, will try both approaches! – Tam Coton May 09 '13 at 08:52
  • @TamCoton, (just an idea) why don't you pre-process the file in python (removing everything unnecessary, doing output only with numbers), resulting in file with 'dimension' 'data', so you can only fscanf("%d") everything? – nothrow May 09 '13 at 08:56
  • @Yossarian Because I'd ideally like a program that can just take the data straight from TSPLIB and use it, without having to run a preprocessing step on every file. If all else fails I'll have to do that, but I'd prefer to get this working. – Tam Coton May 09 '13 at 09:06

2 Answers2

1

I've always found that the scanf functions introduce more problems than they solve.

Personally, I prefer to use fgets:-

char buffer [1024];
file = fopen (filename);

while (fgets (buffer, 1024, file))
{
  ParseState state = FindingLineType;

  for (char *token = strtok (buffer, " ") ; token ; token = strtok (0, " "))
  {
    // parse the token!
    switch (state)
    {
    case FindingLineType:
      if (stricmp (token, "DIMENSION:") == 0)
      {
        state = GettingDimension;
      }
      else
      {
        if (isdigit (*token))
        {
          if matrix has been created
          {
            state = ParsingMatrix;
          }
          else
          {
            error - got matrix row before dimension
          }
        }
      }
      break;

    case GettingDimension:
      dimension = atoi (token);
      create matrix
      break;
    }
  }
}

That should give you some ideas, you can certainly add more error checking.

Nagaraju
  • 1,853
  • 2
  • 27
  • 46
Skizz
  • 69,698
  • 10
  • 71
  • 108
  • I'm sure it works great, but that's way above my C competence level! I have no idea what that code means! – Tam Coton May 09 '13 at 11:50
  • @TamCoton: Don't worry, you'll get there in the end. There is a fair amount missing (enum/variable definitions), and it's not all real C, there's a bit of pseudo code in there too. – Skizz May 09 '13 at 12:55
0

Counting on \n only to determine line ending is not reliable, what if your file has \r as line terminating character? In fact you need to consider \n, \r and \r\n as line ending sequences.

Windows notepad is a kind of "Hello World"-editor. It is very basic and limited and can only deal with \r\n as line ending, this is why it shows one single row if lines are ending with \r or \n.

Line reading in C is not a trivial task, you could use fgets for that if you know how long at maximum a line can be, so you can pass an appropriately allocated buffer. Otherwise you'll have have to deal with unknown line lengths, witch is you case with the COMMENT line. I personally prefer to have witch can deal with all that and just returns a line from the file in one call, see an example of such a read_line function.

In the file format you have presented the size of the matrix come at line 4. So will have to escape to first three line for that just do the following three time

read_line(pf);

On the 4th read_line you have the line containing the size. You need to extract and save it for later.

// DIMENSION:  17
line = read_line(pf);
printf("%s\n", line);
// extact the size of the matrix
tmp = strtok(line, " ");
tmp = strtok(NULL, " ");
size = atoi(tmp);
printf("Parsed size = %d\n", size);
free(line);
line_number++;

Now you have three other lines left to come to your matrix. Escape them just like the first ones.

Now your matrix starts, here you can use fscanf, you could start reading from the file right after you allocate a matrix row, this way you would save some unnecessary iterations.

Here is how you could build your loops:

if(size > 0) {
    printf("Start reaading matrix of size %dx%d\n\n", size, size);
    matrix = malloc(sizeof(double*) * size);
    for(i = 0; i < size; i++) {
        matrix[i] = malloc(sizeof(double) * size);

        n_read = 0;
        for(j = 0; j < size; j++) {
            n_read += fscanf(pf, "%lf", &matrix[i][j]);
            printf("%.2lf\t", matrix[i][j]);
        }
        printf("\n");
        line_number++;
        if(n_read != size) {
            printf("invalid data at line %d, expected to read %d but got %d\n", line_number, size, n_read);
        }
    }
}
Community
  • 1
  • 1
A4L
  • 17,353
  • 6
  • 49
  • 70