2

I have this json stored in a S3 file (which is actually the output of a aws Comprehend EntitiesDetection job => meaning I have unfortunately no control of how this json is organized, it is uploaded to S3 by AWS Job itself, so I can't modify the structure of this json input):

{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 11, "Score": 0.9815415143966675, "Text": "5 start-up", "Type": "QUANTITY"}, {"BeginOffset": 61, "EndOffset": 183, "Score": 0.8883988261222839, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 212, "EndOffset": 327, "Score": 0.8162660002708435, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 337, "EndOffset": 339, "Score": 0.7018660306930542, "Text": "Trump, "Type": "PERSON"}, {"BeginOffset": 364, "EndOffset": 484, "Score": 0.8932908177375793, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 513, "EndOffset": 626, "Score": 0.8157837986946106, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 636, "EndOffset": 638, "Score": 0.6977631449699402, "Text": "Oprah Winfrey", "Type": "PERSON"}, {"BeginOffset": 963, "EndOffset": 971, "Score": 0.4658013880252838, "Text": "facebook", "Type": "ORGANIZATION"}, {"BeginOffset": 972, "EndOffset": 979, "Score": 0.6886632442474365, "Text": "twitter", "Type": "TITLE"}, {"BeginOffset": 985, "EndOffset": 993, "Score": 0.7970104813575745, "Text": "linkedin", "Type": "ORGANIZATION"}, {"BeginOffset": 994, "EndOffset": 998, "Score": 0.36566048860549927, "Text": "Menu", "Type": "TITLE"}
  ],
  "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"},


{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 13, "Score": 0.9995881915092468, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 27, "EndOffset": 69, "Score": 0.8302029371261597, "Text": "Constitution \u00e9conomique\" - African Manager", "Type": "TITLE"}, {"BeginOffset": 94, "EndOffset": 126, "Score": 0.48702114820480347, "Text": ".wpb_animate_when_almost_visible", "Type": "OTHER"}, {"BeginOffset": 290, "EndOffset": 298, "Score": 0.47538018226623535, "Text": "Fran\u00e7ais", "Type": "OTHER"}, {"BeginOffset": 299, "EndOffset": 306, "Score": 0.6746407747268677, "Text": "English", "Type": "OTHER"}, {"BeginOffset": 464, "EndOffset": 476, "Score": 0.9992197155952454, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 515, "EndOffset": 527, "Score": 0.9994662404060364, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 581, "EndOffset": 596, "Score": 0.6652442812919617, "Text": "African Manager", "Type": "ORGANIZATION"}, {"BeginOffset": 599, "EndOffset": 615, "Score": 0.8012278079986572, "Text": "09/09/2019 08:45", "Type": "DATE"}, {"BeginOffset": 674, "EndOffset": 685, "Score": 0.8724801540374756, "Text": "tunisiennes", "Type": "OTHER"}, {"BeginOffset": 689, "EndOffset": 701, "Score": 0.9975908398628235, "Text": "15 septembre", "Type": "DATE"}, {"BeginOffset": 753, "EndOffset": 781, "Score": 0.9481445550918579, "Text": "certain nombre d\u2019initiatives", "Type": "QUANTITY"}
  ],
  "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"}

//and so on ...

I need to find and retrieve all files where there is a type=PERSON with score >0.7 and retrieve the following data: the person and the file.

Today my query expression is:

select s.Text from s3object[*].Entities[*] s where s.Type= 'PERSON' AND s.Score > 0.7;

this outputs:

[

    {
        "Text": "Trump"
    },
    {
        "Text": "Oprah winfrey
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },

]

This is partly good but I need to associate with each "Text" (name of the person) the File where it comes from. So what I expect as the query output is:

[

    {
        "Text": "Trump",
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Oprah winfrey,
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui"
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },

]

How to retrieve this ? Tried a lot of possibilities using https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html, but none worked.

Mathieu
  • 4,587
  • 11
  • 57
  • 112

1 Answers1

3

There's a note in this page you shared:

Note Amazon S3 Select and Glacier Select queries currently do not support subqueries or joins.

I'd set up Athena for more complex queries against S3 directly (example from official doc). Another alternative is to restructure your JSON in a way that you can avoid joins, like duplicating "File" in the "Text" level. Of course you can also index this JSON in many other tools and formats to make data searchable/"queryable".

Fabio Manzano
  • 2,847
  • 1
  • 11
  • 23
  • I see, that's what I feared. Setting up athena programatically and making it analyze all my S3 inputs is over-engeneering in my case I think. It's such a pity to be stuck just because i can't get both data file and Text which sit at different nesting levels of the objects. thanks for your answer – Mathieu Sep 12 '19 at 07:42
  • it's also a cost calculation: option1: i can either change the s3 file ("restructure the json") before doing the S3-select analysis but it requires to access the file, change it and recreate another one (+maybe delete old one)...expensive ; – Mathieu Sep 12 '19 at 07:56
  • option2: or i cna let it like this and instead of only taking the data i need (Text and file) take everything, the whole stuff and then in my app parse it and filter with js to only keep Text and File, which is also expensive as s3 select cost also based on the data you output so outputting the whole stuff is much more expensive than outputting a smaller json with only Text and Files...hard to decide – Mathieu Sep 12 '19 at 07:57
  • Yeah. It seems that Athena starts making sense if you're scanning at least 10MB per query: https://aws.amazon.com/athena/pricing/ On the other hand, if dataset is not large, I'd consider the in-memory option you mentioned. Or even a simple ETL to flatten your dataset and/or indexing it in a database. – Fabio Manzano Sep 12 '19 at 08:10
  • good to know, thanks. Just frustrated because the format i have to deal with as input Is dictated/chosen by aws...so they could have chosen one that allows their own other tool S3 Select to process...sort of bad decision on their side, or a way maybe a little sneaky to force to use athena, even though my files will be more like 800kb than 10Mb:) – Mathieu Sep 12 '19 at 08:16