1

A shell script that I run in IPython returns the following object:

results = ['{"url": "https://url.com", "date": "2020-10-02T21:25:20+00:00", "content": "mycontent\nmorecontent\nmorecontent", "renderedContent": "myrenderedcontent", "id": 123, "username": "somename", "user": {"username": "somename", "displayname": "some name", "id": 123, "description": "my description", "rawDescription": "my description", "descriptionUrls": [], "verified": false, "created": "2020-02-00T02:00:00+00:00", "followersCount": 1, "friendsCount": 1, "statusesCount": 1, "favouritesCount": 1, "listedCount": 1, "mediaCount": 1, "location": "", "protected": false, "linkUrl": null, "linkTcourl": null, "profileImageUrl": "https://myprofile.com/mypic.jpg", "profileBannerUrl": "https://myprofile.com/mypic.jpg"}, "outlinks": [], "outlinks2": "", "outlinks3": [], "outlinks4": "", "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 123, "lang": "en", "source": "<a href=\\"mysource.com" rel=\\"something\\">Sometext</a>", "media": [{"previewUrl": "smallpic.jpg", "fullUrl": "largepic.jpg", "type": "photo"}], "forwarded": null, "quoted": null, "mentionedUsers": [{"username": "name1", "displayname": "name 1", "id": 345, "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "link2url": null, "profileImageUrl": null, "profileBannerUrl": null}]}', ...]

whereas the ... indicates more entries akin to the previous one. According to type(), this is an slist. According to the documentation of the aforementioned shell script, this is a jsonlines file.

Ultimately, I would like to convert this into a csv object where the keys are the columns and the values are the values, where each entry (like the one shown above) is a row. So something like:

url              date                       content   ...
https://url.com  2020-10-02T21:25:20+00:00  mycontent ...

I have tried the solution proposed here but I receive a data frame with key-value pairs like thus:

import pandas as pd
df = pd.DataFrame(data=results)
df = df[0].str.split(',',expand=True)
df = df.rename(columns=df.iloc[0]) 
Tea Tree
  • 882
  • 11
  • 26
  • 1
    There's `DataFrame.to_csv` if you're getting a DataFrame whose results look good. see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html – Nathan Pierson Oct 02 '20 at 21:56
  • Your list contains strings that seem to represent dicts, you may want to `eval` those, @nathanpierson's suggestion should be good after that. Also FYI: "whereas" doesn't mean what you think it means. – Grismar Oct 02 '20 at 22:20
  • The json in the fragment (only a single element) seems to have syntax problems - there's issues with the escaping of quotes for the `"source"` element. The newlines should probably also be escaped as `\\n` – Grismar Oct 02 '20 at 22:25
  • This is, unfortunately, what the script returned, including single escaped newlines. – Tea Tree Oct 02 '20 at 22:32
  • Understandable, but then I'd recommend writing a bit of processing that finds and replaces the mistakes and then pass it on. They are actual errors, so the quality of you data source leaves something to be desired. Replacing the `\n` would be easy, finding the correct replacements for the `"` without escaping may be trickiers, but maybe a straight up replacement with a fitting regex is enough if you're lucky. – Grismar Oct 02 '20 at 22:35

1 Answers1

1

Although your example data contains several issues, if you fix those, this works:

import json
import pandas as pd

fragment = '{"url": "https://url.com", "date": "2020-10-02T21:25:20+00:00", "content": "mycontent\\\\nmorecontent\\\\nmorecontent", "renderedContent": "myrenderedcontent", "id": 123, "username": "somename", "user": {"username": "somename", "displayname": "some name", "id": 123, "description": "my description", "rawDescription": "my description", "descriptionUrls": [], "verified": false, "created": "2020-02-00T02:00:00+00:00", "followersCount": 1, "friendsCount": 1, "statusesCount": 1, "favouritesCount": 1, "listedCount": 1, "mediaCount": 1, "location": "", "protected": false, "linkUrl": null, "linkTcourl": null, "profileImageUrl": "https://myprofile.com/mypic.jpg", "profileBannerUrl": "https://myprofile.com/mypic.jpg"}, "outlinks": [], "outlinks2": "", "outlinks3": [], "outlinks4": "", "replyCount": 0, "retweetCount": 0, "likeCount": 0, "quoteCount": 0, "conversationId": 123, "lang": "en", "source": "<a href=\\"mysource.com\\" rel=\\"something\\">Sometext</a>", "media": [{"previewUrl": "smallpic.jpg", "fullUrl": "largepic.jpg", "type": "photo"}], "forwarded": null, "quoted": null, "mentionedUsers": [{"username": "name1", "displayname": "name 1", "id": 345, "description": null, "rawDescription": null, "descriptionUrls": null, "verified": null, "created": null, "followersCount": null, "friendsCount": null, "statusesCount": null, "favouritesCount": null, "listedCount": null, "mediaCount": null, "location": null, "protected": null, "linkUrl": null, "link2url": null, "profileImageUrl": null, "profileBannerUrl": null}]}'

data = json.loads(fragment)
df = pd.DataFrame([data])
df.to_csv('test_out.csv')

Note: the example data has been fixed in this example, changes:

  • " was properly escaped in 'source'
  • \n was escaped as \\\\n, could be \\n as well, but I don't think you want the newlines in your csv

If results is a list of these:

import json
import pandas as pd

results = get_results_somewhere()

df = pd.DataFrame([json.loads(r) for r in results])
df.to_csv('test_out.csv')

If the errors in your input are limited to the above, you could fix them like this:

def fix_input(s):
    return regex.sub('(?<=<[^>]*?)(")', r'\\"', regex.sub(r'(?<=<[^>]*?)(\\)', '', regex.sub('\n', '\\\\\\\\n', s)))

This unescapes the previously escaped \\" inside <> and then replaces all " inside <> with \\" and it also 'fixes' the newlines. If you have trouble understanding why the regexes work the way they do, that's probably a separate question.

The whole thing:

import json
import regex
import pandas as pd


def fix_input(s):
    return regex.sub('(?<=<[^>]*?)(")', r'\\"', regex.sub(r'(?<=<[^>]*?)(\\)', '', regex.sub('\n', '\\\\\\\\n', s)))


results = get_results_somewhere()
fixed_results = fix_input(results)

df = pd.DataFrame([json.loads(r) for r in fixed_results])
df.to_csv('test_out.csv')

Note: this uses the third party regex instead of re since it uses a variable length lookbehind.

Grismar
  • 27,561
  • 4
  • 31
  • 54
  • This works, awesome! Is there a way to flatten the results (e.g. everything from the "user" entry)? Also, why do we put square brackets around the for loop? – Tea Tree Oct 02 '20 at 23:29
  • If by flattening you mean you only want the contents of the "user" element in the dataframe, that can be done by replacing `json.loads(r)` with `json.loads(r)['user']`. If instead you mean you want to flatten "user" up into the rest of the record, you'd need an additional step before turning it into a dataframe. As for the square brackets - that expression is a so-called list comprehension https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions – Grismar Oct 03 '20 at 11:03
  • I ended up using ```df = pd.concat([df.drop(['user'], axis=1), df.user.apply(pd.Series)], axis=1)``` to flatten the nested dictionary. – Tea Tree Oct 03 '20 at 20:20