1

enter image description hereI have a table that contains several rows of terms and I want to filter rows from the second table that contains these terms in a sentence. Does anyone have an idea how can it be done. Thank you

I did exactly what you show but I guess I have some problem with Rule based row splitter. See the error I'm getting when I try to run it

Community
  • 1
  • 1
Regina
  • 115
  • 4
  • 13

2 Answers2

3

(Disclaimer: I am not familiar with the text processing extension, in case the terms and sentences are from those and not compatible with strings, I hope someone else can help you.)

You can create rules from the terms (I am assuming none of those contain " symbols) using the String Manipulator node like the following:

join("$yourSentenceColumn$ MATCHES \".*?\\Q", $yourTermColumn$, "\\E.*\" => TRUE")

In case your terms contain quote symbols (but you do not want them in matching):

join("$yourSentenceColumn$ MATCHES \".*?\\Q", removeChars($yourTermColumn$, "\""), "\\E.*\" => TRUE")

This is similar to my answer of your previous question, the only addition is the \Q and \E quoting patterns.

After this, you can use these in the Rule-based Row Filter (Dictionary) or Rule-based Row Splitter (Dictionary) nodes as a rule column. (I have not tried this time, but should work.)

Everything together as a KNIME workflow

Community
  • 1
  • 1
Gábor Bakos
  • 8,982
  • 52
  • 35
  • 52
  • Thank you for your fast reply. How can I join them if I have my terms in one table and table with strings to filter in a separate table and the number of rows in two tables doesn't match? – Regina Oct 05 '16 at 15:18
  • You do not have to join them. The rules table is the second input of the dictionary nodes. – Gábor Bakos Oct 05 '16 at 15:46
  • actually my terms contain " symbol because they are of a Document type .Can I remove it somehow? – Regina Oct 05 '16 at 16:27
  • Sorry, looks like I got lost. No matter what I try it is not working.Let say in my term table I have (just example): "Adam Gober" and on a second row "Tall group" etc. In my table I have sentences like "Once upon a time Adam Gober went to school". I can't filter these sentence. I tried what you have suggested but keep getting errors. Thank you for your help – Regina Oct 05 '16 at 16:54
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/125018/discussion-between-gabor-bakos-and-regina). – Gábor Bakos Oct 05 '16 at 16:56
  • @Regina I have updated the answer with an example. It works for me pretty well.Could you [edit](http://stackoverflow.com/posts/39876564/edit) your question with what does not work? – Gábor Bakos Oct 05 '16 at 17:24
  • The issue is resolved. I just replaced the node with the new one and it worked. Thank you for your help. – Regina Oct 05 '16 at 21:21
  • @Regina Thanks for the update. Yes, the outcome column (no column) should be empty in your case. Sorry for not including that in the screenshot. I am glad you managed to fix it. – Gábor Bakos Oct 06 '16 at 12:43
  • how do you make your expression case insensitive? Specifically in the solution you gave me. – Regina Oct 12 '16 at 21:46
0

Under the assumption that each sentence is a row in your table, here below an approach using KNIME's text processing nodes:

  1. Use Strings to Document to convert the text into documents, assign your text column to Title. Beforehand use Constant Value node to create two empty string columns, i.e. one to provide as Authors and another for Full Text, and also apply RowID beforehand to create a column holding the IDs (which you'll conveniently provide as Source to the aforementioned Strings to Document node);
  2. Convert the table using Bag of Words Creator.
  3. Connect your table of terms to search for to the bottom port of Dictionary Tagger, while you connect the bag of words to the upper one. Here it is important that you set named entities to unmodifiable. If you desire it so, you can also make the search case insensitive. As for the tags, just set them to NE (named entities).
  4. Follow the previous node with Modifiable Term Filter - modifiable terms should be filtered out, which leaves you with a term list corresponding exactly to your dictionary. However, there is a difference: each term is now associated to each Document in which it was found.
  5. Use Constant Value to create an integer column containing the number 1 and name it e.g. TermOccurs.
  6. Convert the bag of words back into a document vector using Document vector, assigning TermOccurs as vector value and by using the As collection cell option. You should now have a table with only the documents that contain any of your terms.
  7. Fetch the row ID of each document using the Document Data Extractor (choose the Source) and assign it using RowID.
  8. Use Reference Row Splitter to split your table into two based on the row ID:
    • one containing none of the documents matching any of your dictionary terms,
    • the other containing those documents that do match on at least one term.

If you want to have string columns again, you can always join the tables with the original one before step 1.

I haven't tested the above workflow, keep me posted if it does not work. Plus, you may run into some trouble with multi-term search due to the tokenizer. The latter is the main challenge when working with the text processing nodes.

g3o2
  • 271
  • 3
  • 7