How to parse a valid JSON file that has non valid field inside?

Question

I know that this may seem a strange question, but, the input of my algorithm is a stream of JSON strings composed by syntactically correct JSON blocks, at least for all blocks but this. A block in the stream has this structure:

{
  "comment",
  {
    "author":"X",
    "body":"Hello world",
    "json_metadata":"{\"tags\":[\"hello, world\"],\"community\":\"programming\",\"app\":\"application_for_publish\"}",
    "parent_author":"waggy6",
    "parent_permlink":"programming_in_c",
    "permlink":"re-author-programming_in_c-20180916t035418244z",
    "title":"some_title"
  }
}

So, everything works fine, up to arriving to this block, that I don't know how to parse. The field that gives me troubles is the "json_metadata" one:

{
  "comment", 
  {
    "author": "Y",
    "body": "Hello another world!",
    "json_metadata": "\"{\\\"tags\\\":[\\\"hello\\\",\\\"world\\\"],\\\"app\\\":\\\"application_for_publish_content\\\",\\\"format\\\":\\\"markdown+html\\\",\\\"pollid\\\":\\\"p_id\\\",\\\"image\\\":[\\\"https://un.useful.url/path/image.png\\\"]}\"",
    "parent_author": "",
    "parent_permlink": "helloworld",
    "permlink": "hello_world_programming_in_c-2017319t94958596z",
    "title": "Hello World in C!"
  }
}

It's like this field has been parsed twice, when the data has been acquired. I'm using rapidjson as parsing tool, in C++. The piece of code related to this problem is the following:

static std::string parseNode(const Value &node){
   string toret = "";
   if (node.IsBool())          toret = toret + to_string(node.GetBool());
   else if (node.IsInt())      toret = toret + to_string(node.GetInt());
   else if (node.IsUint())     toret = toret + to_string(node.GetUint());
   else if (node.IsInt64())    toret = toret + to_string(node.GetInt64());
   else if (node.IsUint64())   toret = toret + to_string(node.GetUint64());
   else if (node.IsDouble())   toret = toret + to_string(node.GetDouble());
   else if (node.IsString())   toret = toret + node.GetString();
   else if (node.IsArray())    toret = toret + parseArray(node); // parse the given array
   else if (node.IsObject())   toret = toret + parseObject(node); // parse the given object
   return toret;
}

...

std::string search_member(Value& js, std::string member){
   Value::ConstMemberIterator itr = js.FindMember(member.c_str());
   std::string els = "";
   if(itr != js.MemberEnd())
      els = parseNode(itr->value) + " ";
   return els;
}

...

// op_struct type is Value*; it is the Value* that refers to all the fields of the block
std::string json_m = (*op_struct)["json_metadata"];
std::string elements = "";
if((json_m.compare("") != 0) && (json_m.compare("{}") != 0) && (json_m.compare("\"\"") != 0)){
   Document js;
   js.Parse<0>(json_m.c_str());
   elements = elements + search_member(js, "community") + search_member(js, "tags") + search_member(js, "app");
}
Comment * comment = new Comment(title + " " + body + " " + elements, auth);

...

The problem occurs in the js.FindMember(member.c_str()); row, in the search_member() function, because js.Parse<0>(json_m.c_str()); recognizes that the input is a valid JSON, and indeed it is valid, it refers to:

"\"{\\\"tags\\\":[\\\"hello\\\",\\\"world\\\"],\\\"app\\\":\\\"application_for_publish_content\\\",\\\"format\\\":\\\"markdown+html\\\",\\\"pollid\\\":\\\"p_id\\\",\\\"image\\\":[\\\"https://un.useful.url/path/image.png\\\"]}\""

But, then, the result of this computation, is the string:

"{\"tags\":[\"hello\",\"world\"],\"app\":\"application_for_publish_content\",\"format\":\"markdown+html\",\"pollid\":\"p_id\",\"image\"

And for this reason, the FindMember() function can not find any tags, community or app field, since it is recognized as a string.

My question is: is there any way (different by just skipping this block) with which I can recognize such special cases?

Manually modify your input file. This looks like an erroneous record. Whoever encoded it double-dipped into the backslash Kool-Aid. — JohnFilleau, Apr 07 '20 at 16:57
The problem is that I cannot do that because it is just a tiny little portion of a massive data stream of more than 440 GB. Actually, the only thing that I've thought was to skip the problem and go on. If it is a too difficult problem to deal with and, furthermore, maybe the proposed solution slows the execution time, it's not worth it, then. — Carmine, Apr 07 '20 at 17:05
Where do you get this data stream from? Who has authority over it? Get them to fix it. Or modify your code to do input validation and discard the record or attempt to correct the record if you can't parse the metadata. — JohnFilleau, Apr 07 '20 at 17:29
It's kind of my thesis, and the data belongs to a blockchain. There is nothing I can do about the parsing procedure because it is not part of my work. — Carmine, Apr 07 '20 at 17:32
If none of the desired tags are found in the metadata of a given record, can you just output the raw json record to some "malformed input" file, and then manually process the (hopefully) small number of records that exist in that file at the end? I assume all valid records have at least one of `tags`, `community`, or `app`? At the end of the day you don't want to spend more time on a clever solution than it would take to just brute force it. Does your thesis merely *depend* on the results of these records, or is your thesis the actual program that processes these records? — JohnFilleau, Apr 07 '20 at 17:46
It is the basis for subsequent analysis, but I think that for now, I'll just skip this record. — Carmine, Apr 07 '20 at 18:19

How to parse a valid JSON file that has non valid field inside?

0 Answers0