-4

I download some webpage with JSON code embedded into javascript. I need to decode it but it is incorrect JSON and includes single and double quotes which cause error at decode subroutine.

NOTE: JSON extracted as a block into string variable, DATA block represents some form of incorrect JSON code (mostly the problem is in a part which represents input of website visitor clients), JSON has quite deep recursion structure.

So far I could not find better solution than attached bellow code which is still incorrect.

Is there a better way to doctor received JSON code? [May be with (??{ code}) in regex]

use strict;
use warnings;
use diagnostics;

while( <DATA> ) {
    chomp;
    print "IN:  $_\n";
    s/"/'/g;
    print "OUT: $_\n" if s/'(.*?)'\s*:\s*'(.*?)'(,|\s*\})/"$1": "$2"$3/g;
}

__DATA__
{ "d1": "some data here", "d2":"some "data" here", "d3": "some "data" here "year"", "d4": { "x1": "some "data" here" } }
{ "d2": "some data here", "d2":"some "data" here", "d3": "some "data" here "year"" }
{ 'd3': 'some data here', "d2":"some "data" here", "d3": "some "data" here "year"" }
{ "d4": 'some data here', "d2":"some "data" here", "d3": "some "data" here "year"", "d4": { "x1": "some "data" here" } }
{ 'd5': "some data here", "d2":"some "data" here", "d3": "some "data" here "year"" }

output

IN:  { "d1": "some data here", "d2":"some "data" here", "d3": "some "data" here "year"", "d4": { "x1": "some "data" here" } }
OUT: { "d1": "some data here", "d2": "some 'data' here", "d3": "some 'data' here 'year'", "d4': { 'x1": "some 'data' here" } }
IN:  { "d2": "some data here", "d2":"some "data" here", "d3": "some "data" here "year"" }
OUT: { "d2": "some data here", "d2": "some 'data' here", "d3": "some 'data' here 'year'" }
IN:  { 'd3': 'some data here', "d2":"some "data" here", "d3": "some "data" here "year"" }
OUT: { "d3": "some data here", "d2": "some 'data' here", "d3": "some 'data' here 'year'" }
IN:  { "d4": 'some data here', "d2":"some "data" here", "d3": "some "data" here "year"", "d4": { "x1": "some "data" here" } }
OUT: { "d4": "some data here", "d2": "some 'data' here", "d3": "some 'data' here 'year'", "d4': { 'x1": "some 'data' here" } }
IN:  { 'd5': "some data here", "d2":"some "data" here", "d3": "some "data" here "year"" }
OUT: { "d5": "some data here", "d2": "some 'data' here", "d3": "some 'data' here 'year'" }
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • If you are certain that the data you are after (content wise) does not contain a `'`, would it be easier to simply replace all single quotes with double quotes? Or else, simply replace `'` with `[`"]`, throw that in a group and then use that. – npinti Jan 02 '20 at 08:06
  • JSON may have many includes like `I'm`, `name's`, `name"s`, `"2019"`, `they've`, `It's`... there is not consistency in place were should be `'` the users input `"` for example like `I"m` -- some input is in Cyrillic language as for example `{"title": "описание наименование фильма "новый 2019" сериал"}`. – Polar Bear Jan 02 '20 at 08:10
  • In the general case, no, there is no way to predict which quotes are well-formed. If the keys are as monotonous as your example suggests, a simple regex should be able to reach acceptable accuracy, though you probably still want to perform a manual review. – tripleee Jan 02 '20 at 08:55
  • See some ideas/regex for fixing slightly invalid JSON in the second part of [this post](https://stackoverflow.com/a/37537992/4653379) (and comments) – zdim Jan 02 '20 at 09:30
  • @zdim - thank you for provided reference, it is to the point and directs to information about similar problem [mine is YouTube related]. I found that solution quite informative and it let me _to scoop_ a few ideas. At the moment I found slightly different approach which at least on a few _samples_ produced JSON which was decoded without an error. Further testing will be conducted to see if any abnormalities will manifest themselves. – Polar Bear Jan 02 '20 at 21:54
  • 3
    Based on comments by the OP, the question is wrong. The OP doesn't not actually encounter data such as the one found in their question, but valid JavaScript. Voting to close until this is fixed. – ikegami Jan 03 '20 at 04:43

2 Answers2

2

I'm not going to write a parser for your broken JSON. Teaching how to write a parser is beyond the scope of this site. Besides, you can easily base yours off an existing JSON parser (such as JSON::PP).

What I can do is help you address the only hard part: determining whether a quote ends the literal or needs to be escaped. For example, determining that the second and third quote in "some "data" here" don't end the literal, while the fourth one does.

It turns out it's quite easy to make a reliable guess: Just look ahead! If the quote is followed by optional whitespace whichever of :, ,, } and ] would be valid if the literal ended, the quote probably legitimately ends the literal. Otherwise, it's part of the literal and needs escaping.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • NOTE: JSON I try to process is return by YouTube website. – Polar Bear Jan 02 '20 at 10:10
  • 1
    1) What you posted isn't JSON. It's not even JavaScript. 2) I doubt that what you posted (e.g. `"some "data" here"`) came from Google. 3) So what? – ikegami Jan 02 '20 at 10:29
  • I simplified the data and have left only 'demonstrative' portion -- I expect that this site does not allows _to dump_ 161KByte of Javascript _JSON extraction_. And it would be very difficult to point the the place where `json_decoder` **stumbles** and produces an error. – Polar Bear Jan 02 '20 at 10:43
  • 1
    Re "*I do not see that you have asked for more.*", Then why did you mentioned the data was 161 KB in size? – ikegami Jan 02 '20 at 10:50
  • 1
    Re "*Now please be so kind to look at the source code of following webpage*", To what end? (I see valid JavaScript, unlike what you posted.) – ikegami Jan 02 '20 at 10:50
  • 1
    And what does any of this have to do with my answer? – ikegami Jan 02 '20 at 10:53
0

It looks like you want to make the following corrections:

1) change any json field names from single-quoted to double-quoted.

2) change any json string field values from single-quoted to double-quoted

3) change any nested double-quotes in json string fields to single-quotes inside the double-quoted string

4) ensure there is at least one space after the colon after a json field name.

Can you double-check the ends of examples 1 and 4 in the output section? Surely you don't want to change the json field name to have a colon and a single-quote and a curly-brace in it. Also, if you did mean to do that, then the curly braces are now unbalanced on those lines.

All that said ... regex may not be the right tool for the job. You may need a context-sensitive parser for this (to keep track of all the levels of nesting).

If you have any control over the page that you are downloading - make them fix it. Otherwise ... you pretty much have to be able to handle anything, so you need the context-sensitive parser - and be prepared to throw away some of the input if it gets too messy. ... instead of crashing or going into an infinite loop.

Brenda J. Butler
  • 1,475
  • 11
  • 20
  • Brenda, you are very correct about what correction should be done to make JSON parsable. But I knew it already, my question was if it possible to correct this kind of abnormalities with **regex**. The issue is that JSON has _recursive_ structure and **regex** will require to parse recursively. By looking into [regex](https://perldoc.perl.org/perlre.html) documentation I see such possibility only with use of `(??{ code })` feature. I took this approach but luck of good understanding how it works did not allow me achieve desired result. – Polar Bear Jan 02 '20 at 22:03
  • I would perfectly accept an answer of kind _JSON of this structure is very abnormal to come from a website, please look if extraction and manipulation of the JSON snippet did not **injected** abnormalities. Otherwise this JSON can be processed only with a parser written **to handle** all possible cases_. I kept a hope that somebody more experienced with **regex** could clarify on use `(??{ code })` feature for similar cases. – Polar Bear Jan 02 '20 at 22:09
  • JSON although represent _JavaScript Object Notation_ but JavaScript and JSON keep data slightly different. JavaScript permits `{ 'key': "value"}` but JSON parser expects `{"key":"value"}`. Issue of this kind is easily corrected with **regex** because **key** never has recursive structure. The situation is quite different with **value** part because by definition it can include **recursion**. At this moment I came to other solution which produced desired result and it is in _testing_ stage to see if any other _anomalies_ will manifest themselves. – Polar Bear Jan 02 '20 at 22:20