1

I have a large CSV file in which a line looks like the one below:

id_85,
{
    "link": "some link",
    "icon": "hello.gif",
    "name": "Wall Photos",
    "comments": {
        "count": 0
    },
    "updated_time": "2012-03-12",
    "object_id": "400",
    "is_published": true,
    "properties": [
        {
            "text": "University",
            "name": "By",
            "href": "some link"
        }
    ],
    "from": {
        "id": "7778",
        "name": "Let"
    },
    "message": "Hello World! :D",
    "id": "id_85",
    "created_time": "2012-03-12",
    "to": {
        "data": [
            {
                "id": "100",
                "name": "March"
            }
        ]
    },
    "message_tags": {
        "0": [
            {
                "id": "100",
                "type": "user",
                "name": "Marcelo",
                "length": 7,
                "offset": 0
            }
        ]
    },
    "type": "photo",
    "caption": "Hello world!"
}

I am trying to just get the json part of it between the first and the ending curly brackets.

Below is my python regex code so far

import re 
str = "id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} "
m = re.match(r'.*,({.*}$)', str)
if m:
     print m.group(1)

There are some cases where it does not take the first and last curly brackets, something like this { ... } . How do I ensure that only the text between first and last curly brackets is included and not any other?

The desired output is something that looks like this:

{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"}

Thanks!

pravi
  • 2,029
  • 3
  • 13
  • 9

3 Answers3

0

This will match the entire json part after the first comma. Not sure if this is what you wanted though. An example of desired output would be helpful.

re.match(r'[^,]*,(.*)', s).group(1)
Andrew Johnson
  • 3,078
  • 1
  • 18
  • 24
0

i believe that this works because .* is "greedy" in this context:

import re
str = 'id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} '
m = re.search('({.*})', str)
if m:
    print m.group(0)

this will probably grab too much if you have other JSON strings in your CSV ie it will be too greedy because the final } will be matched by the last occurence of } in str

note that the notation re.search(r'somregex', string) - i.e. the addition of an r before your regex - is called "raw string notation" - this is generally used when you want backslashes to be treated literally and not as regex special characters. see here. E.g. r'\n' matches the two characters \ and n whereas '\n' would match newline characters

danyamachine
  • 1,848
  • 1
  • 18
  • 21
0

Assuming (as originally posted) that each line in the CSV has 1 JSON element, then

re.match(r'^[^{]*({.*})[^}]*$',str).group(1)

should do the trick. That is: discard everything that isn't a { until you find the first one, put everything that follows until you hit a } with no other }'s after it into a group.

Scott Hunter
  • 48,888
  • 12
  • 60
  • 101