3

I have a project in which I need to sort multiple lines of text based on the second, third, etc word in each line, not the first word. For example,

this line is first

but this line is second

finally there is this line

and you choose to sort by the second word, it would turn into

this line is first

finally there is this line

but this line is second

(since line is before there is before this)

I have a pointer to a char array that contains each line. So far what I've done is use strtok() to split each line up to the second word, but that changes the entire string to just that word and stores it in my array. My code for the tokenize bit looks like this:

 for (i = 0; i < numLines; i++) {
   char* token = strtok(labels[i], " ");
   token = strtok(NULL, " ");
   labels[i] = token;
 }

This would give me the second word in each line, since I called strtok twice. Then I sort those words. (line, this, there) However, I need to put the string back together in it's original form. I'm aware that strtok turns the tokens into '\0', but Ive yet to find a way to get the original string back.

I'm sure the answer lies in using pointers, but I'm confused what exactly I need to do next.

I should mention I'm reading in the lines from an input file as shown:

for (i = 0; i < numLines && fgets(buffer, sizeof(buffer), fp) != 0; i++) {
  labels[i] = strdup(buffer);

Edit: my find_offset method

size_t find_offset(const char *s, int n) {
  size_t len;
  while (n > 0) {
     len = strspn(s, " ");
     s += len;
  }

  return len;
} 

Edit 2: The relevant code used to sort

//Getting the line and offset
for (i = 0; i < numLines && fgets(buffer, sizeof(buffer), fp) != 0; i++) {
   labels[i].line = strdup(buffer);
   labels[i].offset = find_offset(labels[i].line, nth);
}


int n = sizeof(labels) / sizeof(labels[0]);
qsort(labels, n, sizeof(*labels), myCompare);
for (i = 0; i < numLines; i++)
  printf("%d: %s", i, labels[i].line); //Print the sorted lines


int myCompare(const void* a, const void* b) { //Compare function
  xline *xlineA = (xline *)a;
  xline *xlineB = (xline *)b;

  return strcmp(xlineA->line + xlineA->offset, xlineB->line + xlineB->offset);
}
nhlyoung
  • 67
  • 2
  • 8
  • 7
    The easiest thing to do is make a copy of the string first. – paddy Jan 30 '19 at 21:55
  • 1
    Waring: If you put the string back together, then `labels[i]` will not point to a nice sub-stings. Sure you want this? – chux - Reinstate Monica Jan 30 '19 at 21:56
  • If I make a copy of the string then how can I get it into its new order? – nhlyoung Jan 30 '19 at 22:17
  • I note that you're using a single delimiter, a space, in your calls to `strtok()`. You can provide a list of delimiters, but you can't later tell which of those delimiters was the one that was zapped by the null byte. So, in the general case, you can't tell what `strtok()` did. If you keep a record of the original length of the data and you only use a single delimiter character, then you could arrange to replace the null bytes added by `strtok()` with the delimiter, reinstating the string. In the general case, you can't do that. Copy the string before tokenizing, or **avoid `strtok()`**. – Jonathan Leffler Jan 30 '19 at 22:50
  • `while (n > 0) { len = strspn(s, " "); s += len; }` is an infinite loop. – chux - Reinstate Monica Jan 31 '19 at 16:48
  • Oh my god how could I have missed that!! – nhlyoung Jan 31 '19 at 16:49
  • strdup is the only viable sollution and probably the fastest especially if we have more delimiters than only the space – 0___________ Jan 31 '19 at 17:31
  • nhlyoung, Why `n` in `qsort(labels, n, sizeof(*labels), myCompare);` instead of `i`? – chux - Reinstate Monica Jan 31 '19 at 17:53
  • Do you mean the i in the for loop? If I stick the qsort in the loop it gives me erroneous results (lines are out of order) – nhlyoung Jan 31 '19 at 17:58

2 Answers2

4

Perhaps rather than mess with strtok(), use strspn(), strcspn() to parse the string for tokens. Then the original string can even be const.

#include <stdio.h>
#include <string.h>

int main(void) {
  const char str[] = "this line is first";
  const char *s = str;
  while (*(s += strspn(s, " ")) != '\0') {
    size_t len = strcspn(s, " ");

    // Instead of printing, use the nth parsed token for key sorting
    printf("<%.*s>\n", (int) len, s);

    s += len;
  }
}

Output

<this>
<line>
<is>
<first>

Or

Do not sort lines.

Sort structures

typedef struct {
  char *line;
  size_t offset;
} xline;

Pseudo code

int fcmp(a, b) {
  return strcmp(a->line + a->offset, b->line + b->offset);
}

size_t find_offset_of_nth_word(const char *s, n) {
  while (n > 0) {
    use strspn(), strcspn() like above
  }
}

main() {
  int nth = ...;
  xline labels[numLines];
  for (i = 0; i < numLines && fgets(buffer, sizeof(buffer), fp) != 0; i++) {
     labels[i].line = strdup(buffer);
     labels[i].offset = find_offset_of_nth_word(nth);
  }

  qsort(labels, i, sizeof *labels, fcmp);

}

Or

After reading each line, find the nth token with strspn(), strcspn() and the reform the line from "aaa bbb ccc ddd \n" to "ccd ddd \naaa bbb ", sort and then later re-order the line.


In all case, do not use strtok() - too much information lost.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • This is a great answer! I'm currently trying to work it out via structs, but I run into a problem with the find_offset method. Could you elaborate? – nhlyoung Jan 31 '19 at 16:38
  • @nhlyoung Post your troublesome `find_offset()` code and I'll see. Also what exactly is the "problem"? – chux - Reinstate Monica Jan 31 '19 at 16:40
  • It doesn't print anything. I have it set up to print the original first three lines after they're read. Reading each line like labels[i].line = strdup(buffer) works fine, and then I can print those lines. But if I try to find the offset, the program freezes and prints nothing. If I comment that line out it prints, so I think it's a problem with the find offset method – nhlyoung Jan 31 '19 at 16:46
  • @nhlyoung [infinite loop](https://stackoverflow.com/questions/54450118/how-to-restore-string-after-using-strtok/54450265?noredirect=1#comment95738813_54450118) – chux - Reinstate Monica Jan 31 '19 at 16:48
  • Thank you so much, I feel like I'm so close but there's just one more thing. No matter what I change "nth" to it still sorts and prints the lines based on the first word in each line. I need it to sort based on the "nth" word if that makes sense. I'll post the relevant code snippets – nhlyoung Jan 31 '19 at 16:53
  • @nhlyoung Notice this answer uses 2 alternating function: `strspn(s, " ")`, `strcspn(s, " ")` to walk the string. BTW, you can post your own answer. – chux - Reinstate Monica Jan 31 '19 at 16:56
  • So I should use a while loop within a while loop? Like while(*s( += strspn(s, " ")) != '\0') comes right after the (while n > 0) ? – nhlyoung Jan 31 '19 at 17:10
  • @nhlyoung Just one loop. 2 conditions to stop: reached the nth token, reached the end of the line. something like `while (not at end of line && i < n)` – chux - Reinstate Monica Jan 31 '19 at 17:36
  • I still have a bug if I choose to sort based on the greatest word (the 4th word in this case since the shortest sentence has 4 words). It's not in correct order, I posted an answer with my find_offset method and the problem! – nhlyoung Jan 31 '19 at 17:50
1

I need to put the string back together in it's original form. I'm aware that strtok turns the tokens into '\0', but Ive yet to find a way to get the original string back.

Far better would be to avoid damaging the original strings in the first place if you want to keep them, and especially to avoid losing the pointers to them. Provided that it is safe to assume that there are at least three words in each line and that the second is separated from the first and third by exactly one space on each side, you could undo strtok()'s replacement of delimiters with string terminators. However, there is no safe or reliable way to recover the start of the overall string once you lose it.

I suggest creating an auxiliary array in which you record information about the second word of each sentence -- obtained without damaging the original sentences -- and then co-sorting the auxiliary array and sentence array. The information to be recorded in the aux array could be a copy of the second word of the sentence, their offsets and lengths, or something similar.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157