2

I'm currently writing a c program that will take 3 arguments, two files (one input and one output) and an int (the max length of output lines, call it x). I want to read every line in the input file and write the first x characters to the output file (effectively "trimming" the file).

Here is my code:

int main(int argc, char *argv[]) {

  const char endOfLine = '\n';

  if (argc < 4) {
    printf("Program takes 4 params\n");
    exit(1);
  } else {
    // Convert character argument [3] (line length) to an int
    int maxLen = atoi(argv[3]);

    char str[maxLen];
    char *inputName;
    char *outputName;

    inputName = argv[1];
    outputName = argv[2];

    // Open files to be read and written to
    FILE *inFile = fopen(inputName, "r");
    FILE *outFile = fopen(outputName, "w");

    int count = 0;
    char ch = getc(inFile);
    while (ch != EOF) {
        if (ch == '\n') {
          str[count] = (char)ch;
          printf("Adding %s to output\n", str);
          fputs(str, outFile);
          count = 0;
        } else if (count < maxLen) {
          str[count] = ch;
          printf("Adding %c to str\n", ch);
          count++;
        } else if (count == maxLen) {
          str[count] = '\n';
        }
        ch = getc(inFile);
    }

  }

  return 0;
}

The only problem is that if the last character is a single quote mark, it prints out non UTF-8 characters, as such:

For Whom t
John Donne
No man is 
Entire of 
Each is a 
A part of 
If a clod 
Europe is 
As well as
As well as
Or of thin
Each man��
For I am i
Therefore,
For whom t
BLUEPIXY
  • 39,699
  • 7
  • 33
  • 70
rafro4
  • 65
  • 7

1 Answers1

1

You could check if the last char output was a utf-8 continuing byte 10xxxxxx and if so, keep outputting until the character is complete.

// bits match 10xxxxxx
int is_utf_continue_byte(int ch){
    return ch & 0x80 && ~ch & 0x40;
}

//...
while (is_utf_continue_byte(ch))
    putchar(ch), ch = getchar();
luser droog
  • 18,988
  • 3
  • 53
  • 105
  • How would I go about doing so? – rafro4 Dec 09 '16 at 04:05
  • First, make `ch` an `int` so the `EOF` comparison is correct, then `while (ch & 0x80 && ~ch & 0x40) putchar(ch), ch = getchar();` This checks that bit 7 is 1 (`ch & 0x80`) and bit 6 is 0 (`~ch & 0x40`). For the utf-8 format, only continuing bytes fit this pattern. – luser droog Dec 09 '16 at 04:09
  • Why write it as the more confusing `(ch & 0x80 && ~ch & 0x40)` and not `(ch & 0xC0) == 0x80`? And why avoid braces in loops? – user253751 Dec 09 '16 at 04:43
  • For the first, it seemed conceptually simpler to test the bits separately, but yours is the way I usually do it. For the second, just a style choice. – luser droog Dec 09 '16 at 05:37