0

Using: Unix 2.6.18-194.el5

I am having an issue where this join statement is omitting values/indexes from the match. I found out the values are between 11-90 (out of about 3.5 Million entries) and I have tried to look for foreign characters but I may be overlooking something (Tried cat -v to see hidden characters).

Here is the join statement I am using (only simplified the output columns for security):

join -t "|" -j 1 -o 1.1 2.1 file1 file2> fileJoined

file1 contents (first 20 values):

1 3 7 11 12 16 17 19 20 21 27 28 31 33 34 37 39 40 41 42

file2 contents (first 50 values so you can see where it would match):

1|US 2|US 3|US 4|US 5|US 6|US 7|US 8|US 9|US 10|US 11|US 12|US 13|US 14|US 15|US 16|US 17|US 18|US 19|US 20|US 21|US 22|US 23|US 24|US 25|US 26|US 27|US 28|US 29|US 30|US 31|US 32|US 33|US 34|US 35|US 36|US 37|US 38|US 39|US 40|US 41|US 42|US 43|US 44|US 45|US 46|US 47|US 48|US 49|US 50|US

From my initial testing it appears that file2 is the culprit. Because when I create a new file with values 1-100 I am able to get the join statement to match completely against file1; however the same file will not match against file2.

Another strange thing is that the file is 3.5 million records long and at value 90 they start matching again. For example, the output of fileJoined looks like this (first 20 values only):

1|1 3|3 7|7 90|90 91|91 92|92 93|93 95|95 96|96 97|97 98|98 99|99 106|106 109|109 111|111 112|112 115|115 116|116 117|117 118|118

Other things I have tried are:

  1. Using vi to manually enter a new line 11 (still doesnt match on the join statement)
  2. copying the code into notepad, deleting the lines in vi and then copying them back in (same result, no matching 11-90)
  3. Removing lines 11-90 to see if the problem then shifts to 90-170 and it does not shift

I think that there may be some hidden values that I am missing, or that the 11 - 90 from file1 is not the same binary equivalent as the 11 - 90 in file2?

I am lost here, any help would be greatly appreciated.

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
fullern14
  • 3
  • 1
  • As a side comment, I am able to get all the values extracted using: grep -f file1New file2 where file1New has all the same values but a ^ in front and a | in the end. For example: ^1| ^3| ^7| ^11| However, this will take FOREVER to run on a large file like the one I am running this against – fullern14 Jun 17 '15 at 19:29

1 Answers1

0

I tried this out, and I noticed a couple things.

First: this is minor, but I think you're missing a comma in your -o specifier. I changed it to -o 1.1,2.1.

But then, running it on just the fragments you posted, I got only three lines of output:

1|1
3|3
7|7

I think this is because join assumes alphabetical sorting, while your input files look like they're numerically sorted.

Rule #1 of join(1) is to make sure your inputs are sorted, and the same way join expects them to be!

When I ran the two input files through sort and then joined again, I got 18 rows of output. (Sorting was easy, since you're joining on the first column; I didn't have to muck around with sort's column specifiers.)

Beware that, these days, sort doesn't always sort the way you expect, due to locale issues. I tend to set LC_ALL=C to make sure I get the old-fashioned behavior I'm used to.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • Are you using sort -g to sort the columns? – fullern14 Jun 17 '15 at 18:48
  • @fullern14: No, plain sort. join is always going to assume a default sort (not a numeric one or anything), so in general you have to sort your inputs that way, too. – Steve Summit Jun 17 '15 at 18:56
  • So I am able to get more matches using only the sort without the numeric -g; however, it still will not match all of the values in the first dataset "file1" which is still the underlying problem. All of the numbers (20 total) listed in file1 are indeed in file2 in the first column and theoretically they should all match. – fullern14 Jun 17 '15 at 19:20
  • Are you sorting *both* files? – Steve Summit Jun 17 '15 at 21:12
  • yes both files are being sorted. But even in your run you are coming up 2 records short as well if you are only finding 18 values right? This is a very strange occurrence for me because another dataset runs just fine – fullern14 Jun 18 '15 at 00:21
  • You're right. I didn't notice that there were two missing even in the small problem you presented and I was running. I see the problem. 1 should sort before 10, but due to the '|' character after it it's sorting after. I thought it would work to run it through `sort -t'|'`, but it's not helping, and I'm not sure why. – Steve Summit Jun 18 '15 at 00:31
  • I'm increasingly convinced that this is a bug, or at least a serious misfeature, in sort. It's vital when using join that the file be sorted properly, but sort with the `-t'|'` flag is simply not sorting in the same way that join with that same flag expects. I've asked a new question about this at http://stackoverflow.com/questions/30905992/sort-on-pipe-delimited-fields-not-behaving-as-expected . – Steve Summit Jun 18 '15 at 03:31
  • Okay, got it. You can thank @JohnKugelman for this: when you sort your file2 prior to joining, you need to also use `-k 1,1` to force it to consider *only* the first column. (Strikes me as odd, but `sort` is an old-school tool with more than its share of odd quirks.) – Steve Summit Jun 18 '15 at 03:46