0

I add the Ingest Attachment Processor Plugin on to Elastic.

Than I create a very simple pdf file.

This file (the content) I try to inject into Elastic. (see commands below)

But the try to find a word out of the file fails. (see third answer near the lower end of the commands)

What is wrong or which step is missing?

Do I need to add some pipeline?

Is the PUT of the pdf correct and do I need to set the pdf content into the content field of the PUT command?

console commands...

1 console:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
      }
    }
  ]
}

1 answer:

{
  "acknowledged" : true
}

2 console:

PUT my_index/_doc/001?pipeline=attachment
{
       "filename": "C:\\ELK-Stack\\Test.pdf",
       "data": "VGVzdA0KVGVzdCBEb2t1bWVudCB1bWdld2FuZGVsdCB2b24gd28NCkhpZXIgd2lyZCBnZXRlc3RldC4gRGFzIGlzdCBkZXIgVGVzdA==",
       "attachment": {
          "content_type": "application/rtf",
          "language": "ro",
          "content": "Test Test Dokument umgewandelt von word zu pdf. Hier wird getestet. Das ist der Test."
       },
       "title": "Quick"
}

2 answer:

{
  "_index" : "my_index",
  "_id" : "001",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

3 console:

GET /my_index/_search 
{
  "query": {
    "match": {
      "content": "Test"
    }
  }
}

3 answer:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

4 console:

GET /_search
{
    "query": {
        "match_all": {}
    }
}

4 answer:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_id" : "001",
        "_score" : 1.0,
        "_source" : {
          "filename" : """C:\ELK-Stack\Test.pdf""",
          "data" :       "VGVzdA0KVGVzdCBEb2t1bWVudCB1bWdld2FuZGVsdCB2b24gd28NCkhpZXIgd2lyZCBnZXRlc3RldC4gRGFzIGlzdCBkZXIgVGVzdA==",
          "attachment" : {
            "content_type" : "text/plain; charset=windows-1252",
            "language" : "et",
            "content" : """Test
Test Dokument umgewandelt von wo
Hier wird getestet. Das ist der Test""",
            "content_length" : 77
          },
          "title" : "Quick"
        }
      }
    ]
  }
}
Frank Mehlhop
  • 1,480
  • 4
  • 25
  • 48
  • Can you share the mappings of your index? – LeBigCat May 04 '22 at 09:01
  • @LeBigCat I did not use any mapping, just what you see above. (I like to keep it simple) – Frank Mehlhop May 04 '22 at 09:13
  • Can you share a matchall then? To see/check what es indexed. – LeBigCat May 04 '22 at 09:23
  • @LeBigCat I add the match all request to the initial post (lower end). – Frank Mehlhop May 04 '22 at 09:34
  • 1
    It must be a POST not a GET for the 3rd. I passed the request on local and it seems elastic map all fields as keyword, so they cant be match except by exact value. Also you have to specify full path of subfields ("match": { "attachment.language": "ro" } or "match": { "attachment.content": "Test" } for your exemple – LeBigCat May 04 '22 at 09:54

1 Answers1

0

Thanks to LeBigCat I find the solution.

I needed to add the full path to the field,

using: "attachment.content": "Test"

(instead of "content": "Test")

GET /my_index/_search 
{
  "query": {
    "match": {
      "attachment.content": "Test"
    }
  }
}
Frank Mehlhop
  • 1,480
  • 4
  • 25
  • 48