How to remove comments on the same lines as instructions?

Question

I am trying to remove comments from an assembly instructions file, then print clean instructions in another file.

For example: This

TESTD       TD      STDIN
            JEQ     TESTD       . Loop until ready
            RD      STDIN       . Get input to see how many times to loop
            STA     NUMLOOP     . Save the user's input into NUMLOOP
STLOOP      STX     LOOPCNT     . Save how many times we've loops so far

Becomes this:

TESTD       TD      STDIN
            JEQ     TESTD
            RD      STDIN
            STA     NUMLOOP
STLOOP      STX     LOOPCNT

==========================================================================

I wrote this program to remove any thing after the point that mark comments including the dot, but it did not work; output file have the same line containing the comments as the input file.

#include <stdio.h>
#include <stdlib.h>


int main(int argc, char* argv[])
{

    FILE *fp, *fp2;                         // input and output file pointers
    char ch[1000];
    fp = fopen(argv[1], "r");
    fp2 = fopen(argv[2],"w");
    while( fgets(ch, sizeof(ch), fp))            // read line into ch array
    {
        int i = 0 ;
        for( ch[i] ; ch[i]<=ch[1000] ; ch[i++])   // loop to check all characters in the line array
        {
            if(ch[i] == '.')
            {

                for( ch[i] ; ch[i] <= ch[1000] ; ch[i++])  
                {
                    ch[i] = '\0';  //making all characters after the point NULL
                }
            }
            continue;
        }

        fputs(ch, fp2);
    }
    fclose(fp);
    fclose(fp2);
    return(0);
}

the loop condition is strange. you compared ch[i] with ch[1000]. the string ch is null terminated. so for (i=0; i — Junhee Shin, Mar 22 '18 at 08:00
Nulling all unwanted characters is not necessary - if you make the first one Nul, the string will end there, so that's all you need. — Arndt Jonasson, Mar 22 '18 at 08:04
Debug your code! [What is a debugger and how can it help me diagnose problems?](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) — Max Vollmer, Mar 22 '18 at 08:06
sometimes writing a program is overkill, simple way to solve this with nawk: `cat with_comments.txt | nawk ' { x= index($0,"."); if (x) print substr($0,1,x-1); else print $0; endif }'` and i bet even this is too long — Tommylee2k, Mar 22 '18 at 08:36
@Tommy, awk is overkill, too; it’s just as easy with sed s/\\..*// — prl, Mar 22 '18 at 09:13
The cleanest solution: use a state machine. (you'll need two states, or some more if you want to maintain whitespace before the '.') — joop, Mar 22 '18 at 10:25

Max Vollmer · Accepted Answer · 2018-03-22T08:22:41.430

Your loop makes no sense:

for( ch[i] ; ch[i] <= ch[1000] ; ch[i++])

because ch[i] <= ch[1000] compares the value of the characters with each other. Also ch[1000] is invalid, because the size of your array is 1000, so the highest valid index is 999.

Since your loop will end as soon as ch[i] <= ch[1000] is false, and since we don't know what the value of ch[1000] might be, there is a great chance that your loop won't ever be executed, thus no lines are modified at all. (There is also a chance that your loop will loop forever, if ch[1000] evaluates to some value that is always bigger than any character in your array.)

Correct loop: for(i = 0; i < 1000; i++)

Even better would be to check for the actual line length in your loop.

Also you don't need to set all characters to 0 after the dot. What for? Just set ch[i] to 0 and you're done:

if(ch[i] == '.')
{
    ch[i] = '\0';
}

On a side note, the continue; at the end of your loop is redundant. You probably want to place it inside the if-statement after setting the dot to 0.

Scheff's Cat · Answer 2 · 2018-03-22T08:34:49.953

2

I consider the design of OP's code a bit unlucky. Due to the internal buffer char ch[1000];, there is a constraint which could be easily prevented:

#include <stdio.h>

int main(void)
{
  FILE *fIn = stdin, *fOut = stdout;
  /* loop until end of input */
  for (int c = getc(fIn); c >= 0;) {
    /* read and write until '.' is detected */
    for (; c >= 0 && c != '.'; c = getc(fIn)) {
      if (putc(c, fOut) < 0) {
        fprintf(stderr, "Write failed!\n"); return -1;
      }
    }
    /* read only until '\n' is detected */
    for (; c >= 0 && c != '\n'; c = getc(fIn)) ;
  }
  /* done */
  return 0;
}

Instead of buffering a whole line, my approach just interleaves character reading and writing.

There are two read loops: one with output, one without:

At character . the first is left for second.
At character \n the second is left for first.
Everything ends when end of input is detected. (Hence all the checks for c >= 0.)

Input:

TESTD       TD      STDIN
            JEQ     TESTD       . Loop until ready
            RD      STDIN       . Get input to see how many times to loop
            STA     NUMLOOP     . Save the user's input into NUMLOOP
STLOOP      STX     LOOPCNT     . Save how many times we've loops so far

Output:

TESTD       TD      STDIN
            JEQ     TESTD       
            RD      STDIN       
            STA     NUMLOOP     
STLOOP      STX     LOOPCNT

Life demo on ideone

edited Mar 22 '18 at 08:34

answered Mar 22 '18 at 08:31

Scheff's Cat

19,528
6
28
56

Isn't an explanation of why the comments end up in the output and how to fix that problem more what OP wants than an improved design? Please explain the source of OPs problem and how your answer solves it. – Yunnosch Mar 22 '18 at 08:33
@Yunnosch I guess the other answer elaborated this sufficiently. So, are alternative approaches (to the problem as whole) not welcome? – Scheff's Cat Mar 22 '18 at 08:36
Kind of an altruistic approach to write an answer which shows alternatives to a separate answer which gets the upvotes. My respect for that. Since your code seems quite good, I in your place would probably race for the upvotes and try to make my answer standalone good, too. For that I would include the explanation of the root cause in my own answer (even if it is then visible twice). The "honorable" reason would be that the other answer might get deleted (stranger things have happened). It also seems possible to write a better, easier understandable root cause explanation. – Yunnosch Mar 22 '18 at 10:32
2

@Yunnosch: The reason you'd write an answer like this is that the whole design of the code in the question is pretty bad, and any future reader who's having the same problem may have the same design flaws in their code. Or just so such bad code can't go without comment, to stop beginners from using it as an example. A general code review of the code in the question is not a bad thing to have on SO. Unfortunately `getc` and `putc` aren't super fast (they have to lock/unlock the `FILE*`), but this won't matter for small files. It does make for clean code. – Peter Cordes Mar 22 '18 at 11:33
1

@Yunnosch: TL:DR: I write answers to be correct and useful. This is why people upvote them. I don't write them specifically to get upvotes, I write them because people are wrong on the Internet and this needs to be fixed for the benefit of future readers. I care about SO as a collection of useful knowledge. (Although I probably am addicted to my Internet nerd points, too). Debugging questions are probably the least useful/interesting, with the least future value, because you can't usually search for them until you're already debugged your problem. – Peter Cordes Mar 22 '18 at 11:39

score 0 · Answer 3 · answered Mar 22 '18 at 12:14

0

A simple state machine with two states (basically do/dont echo the character):

#include <stdio.h>
int main(void)
{
int state;

for(state=0; ; ) {
        int ch;
        ch = getc(stdin);
        if (ch == EOF) break;
        switch (state) {
        case 0: /* initial */
                if (ch == '.') { state = 1; continue; }
                break;
        case 1:
                if (ch == '\n') { state = 0; break; }
                continue;
                }
        putc(ch, stdout);
        }
return 0;
}

answered Mar 22 '18 at 12:14

joop

4,330
1
15
26

This would be useful if you expanded it read a whole buffer of input at a time and loop over the buffer, while still avoiding arbitrary line-length limits like the OP's code has. If you don't mind [the performance hit of using getc / putc](http://pubs.opengroup.org/onlinepubs/009696799/functions/getc_unlocked.html) inside the inner loop, Scheff's answer looks easier to verify as correct. No break vs. continue to mentally sort out. Having a separate loop for each state is also what you want the compiler to do anyway, instead of branching inside one bigger loop. – Peter Cordes Mar 22 '18 at 13:45
getc/putc is actually faster(since getc is a macro in most cases it can be inlined. This is also the reason to only call it at one place inside the loop) . And it doesn't require a buffer. break/continue is part of the language, and using it is basically a matter of taste and/or experience. [personally, I find nested loops harder to read] – joop Mar 22 '18 at 13:53
Did you read that link? In modern C implementations, `getc` has to use locking because it has to be thread-safe. Use POSIX `getc_unlocked` if you want somewhat-fast single-char stdio, or read into a buffer that easily fits in L1d cache, e.g. 1024 bytes and loop over that. That can let the compiler auto-vectorize. I looked at compiler output from your version vs. Scheffs on the Godbolt compiler explorer (https://godbolt.org/g/frSfM6), and you can see that `getc` compiles into a `call _IO_getc`. So while the standard allows it to be a macro, glibc doesn't make it a big macro. – Peter Cordes Mar 22 '18 at 15:37
But fortunately gcc is able to sort out your `switch` and turn it back into two separate simple loops with their own `call _IO_getc`, which jump to the other loop when detecting their break condition. Very much like putting two simple loops inside an outer infinite loop. – Peter Cordes Mar 22 '18 at 15:39
BTW: I don't care about threads. The OP didn't either. My gcc happily retains the single loop, with two unconditional jumps (loop back) and four conditional (2 fwd, 2 rev) – joop Mar 22 '18 at 15:48
I don't care about threads either, but the same `getc` library function is used by single-threaded and multi-threaded code. glibc actually does avoid locking in the single-threaded case, by keeping a flag in the `FILE` object which says whether it needs locking or not. https://code.woboq.org/userspace/glibc/libio/getc.c.html#31. And `_IO_getc_unlocked` does inline into that, so `getc` is pretty fast when there are chars in the buffer, but it still updates a counter in memory so it's slower than looping over an array of char. The overhead of `fgets` often pays back with a faster loop. – Peter Cordes Mar 22 '18 at 16:18

How to remove comments on the same lines as instructions?

3 Answers3