0

I have a file which contain multiple rows of item codes as follows. There are 1 million rows similar to these

  1.  123,134,256,345,789.....
  2.  123,256,345,678,789......
   .
   .  

I would like to find the count of all the pair of words/items per row in the file using q in kdb+. i.e. any two pair of words that occur in the same row can be considered a word pair. e.g:

(123,134),(123,256),(134,256), (123,345) (123,789), (134,789) are some of the word pairs in row 1 (123,256),(123,345),(123,345),(678,789),(345,789) are some of the word pairs in row 2

word/item pair count  

 `123,134----1 
  123,256---2
  345,789---2`

I am reading the file using read0 and have been able to convert each line into list using vs and using count each group to count the number of words, but now I want to find the count of all the word pairs per row in the file.

Thanks in advance for your help

Thomas Smyth - Treliant
  • 4,993
  • 6
  • 25
  • 36
Abhinav Choudhury
  • 319
  • 2
  • 3
  • 15

3 Answers3

2

I'm not 100% I understand your definition of a word-pair. Perhaps you could expand a little if my logic doesn't match what you were looking for.

In the example below, I've created a 5x5 matrice of symbols for testing - selected distinct pairs of values from each row, and then checked how many rows each of these appeared in, in total.

Please double check with your own results.

q)test:5 cut`$string 25?5

q)test
2 0 1 0 0
2 4 4 2 0
1 0 0 3 4
2 1 1 4 4
3 0 3 4 0

q)count each group raze {l[where(count'[l:distinct distinct each asc'[x cross x:distinct x]])>1]} each test
0 2| 2
1 2| 2
0 1| 2
2 4| 2
0 4| 3
1 3| 1
1 4| 2
0 3| 2
3 4| 2
Thomas Smyth - Treliant
  • 4,993
  • 6
  • 25
  • 36
1

To add some other cases to Matthew's answer above, if what you want is to break the list down into pairs in this way:

l:"a,b,c,d,e,f,g"

becomes

"a,b"
"b,c"
"c,d"
"d,e"
"e,f"
"f,g"

so only taking valid pairs, you could use something like this:

f:{count each group b flip 0 1+\:til 1+count[b:","vs x]-1}

q)f l
,"a" ,"b"| 1
,"b" ,"c"| 1
,"c" ,"d"| 1
,"d" ,"e"| 1
,"e" ,"f"| 1
,"f" ,"g"| 1

where we're splitting the input list on ".", then using indexing to get a list of each element and the element directly to its right, then grouping the resultant list of pairs to count the distinct pairs. If you want to split it so l becomes

"a,b"
"c,d"
"e,f"  

then you could use this:

g:{count each group b flip 0 1+\:2*til count[b:","vs x]div 2}

q)g l
,"a" ,"b"| 1
,"c" ,"d"| 1
,"e" ,"f"| 1

Which uses a similar approach, starting with the even-positioned elements and getting those to their right, and repeating as above. You can easily apply these to the rows read with read0:

r:read0`:file.txt
f each r

will output a dictionary of the counts of each pair for each row, and this can be summed to give the total count of each word pair with each method throughout the file.

Hope this helps - it's still not clear what you mean by pairs, so if neither my answer not Matthew's is of some use, you could edit in a more complete explanation of what you'd like and we can help with that.

Ryan McCarron
  • 889
  • 4
  • 10
1

If you want to consider all possible combinations of 2 pairs in each row then this may be of help. The following function can be used to give distinct combinations, where x is the size of the list and y is the length of the combination:

q)comb:{$[x=y;enlist til x;1=y;flip enlist til x;.z.s[x;y],.z.s[x;y-1],'x-:1]}
q)comb[3;2]
0 1
0 2
1 2

From here we can index into each list to get the pairs, then raze to give a single list of all pairs, group to get the indices where each pair occurs and then count the number of indices in each group:

q)a
123 134 256 345 789
123 256 345 678 789
q)count each group raze{x comb[count x;2]}'[a]
123 134| 1
123 256| 2
134 256| 1
...
345 789| 2
...
Thomas Smyth - Treliant
  • 4,993
  • 6
  • 25
  • 36