I'm wondering if there is any way that the API endpoint allows for the analyzeSyntax
API response JSON to not include sub-attributes of partOfSpeech
dictionaries if they are *_UNKNOWN
? When looking at details around the document input, I can't find any way to limit the response document contents of partOfSpeech
.
Is this something that will only be handled when cleaning the data, post-response?
Example query per API docs here in a file called request.json
:
{
"encodingType": "UTF8",
"document": {
"type": "PLAIN_TEXT",
"content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."
}
}
Command executed:
curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}" \
-s \
-X POST \
-H "Content-Type: application/json" \
--data-binary @request.json > response.json
Sample of response:
{
"sentences": [
{
"text": {
"content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
"beginOffset": 0
}
},
{
"text": {
"content": "Sundar Pichai said in his keynote that users love their new Android phones.",
"beginOffset": 105
}
}
],
"tokens": [
{
"text": {
"content": "Google",
"beginOffset": 0
},
"partOfSpeech": {
"tag": "NOUN",
"aspect": "ASPECT_UNKNOWN",
"case": "CASE_UNKNOWN",
"form": "FORM_UNKNOWN",
"gender": "GENDER_UNKNOWN",
"mood": "MOOD_UNKNOWN",
"number": "SINGULAR",
"person": "PERSON_UNKNOWN",
"proper": "PROPER",
"reciprocity": "RECIPROCITY_UNKNOWN",
"tense": "TENSE_UNKNOWN",
"voice": "VOICE_UNKNOWN"
},
"dependencyEdge": {
"headTokenIndex": 7,
"label": "NSUBJ"
},
"lemma": "Google"
},
{
"text": {
"content": ",",
"beginOffset": 6
},
"partOfSpeech": {
"tag": "PUNCT",
"aspect": "ASPECT_UNKNOWN",
"case": "CASE_UNKNOWN",
"form": "FORM_UNKNOWN",
"gender": "GENDER_UNKNOWN",
"mood": "MOOD_UNKNOWN",
"number": "NUMBER_UNKNOWN",
"person": "PERSON_UNKNOWN",
"proper": "PROPER_UNKNOWN",
"reciprocity": "RECIPROCITY_UNKNOWN",
"tense": "TENSE_UNKNOWN",
"voice": "VOICE_UNKNOWN"
},
"dependencyEdge": {
"headTokenIndex": 0,
"label": "P"
},
"lemma": ","
},
...
...
This response JSON is 819 lines, with 314 lines (nearly 40% of the response!) being *_UNKNOWN
values for partOfSpeech
attributes. So, completely useless, yet significantly adding to the amount of data in a response.
The documentation doesn't seem to provide parameters that could help with this. Am I missing something, or does this API not support an argument for dropping those keys when they are *_UNKNOWN
? Is this something that can only be managed post-response with data cleaning?