I have a large Jsonl file (6GB+) which I need to convert to .csv format. After running:
import json
with open(root_dir + 'filename.json') as json_file:
for line in json_file:
data = json.loads(line)
print(data)
Many records of the below format are returned:
{'url': 'https://twitter.com/CHItraders/status/945958273861275648', 'date': '2017-12-27T10:03:22+00:00', 'content': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'renderedContent': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - BizNews.com biznews.com/wealth-buildin…', 'id': 945958273861275648, 'username': 'CHItraders', 'user': {'username': 'CHItraders', 'displayname': 'CHItraders', 'id': 185663478, 'description': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'rawDescription': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'descriptionUrls': [], 'verified': False, 'created': '2010-09-01T14:52:28+00:00', 'followersCount': 1196, 'friendsCount': 490, 'statusesCount': 38888, 'favouritesCount': 10316, 'listedCount': 58, 'mediaCount': 539, 'location': 'Chicago, IL', 'protected': False, 'linkUrl': None, 'linkTcourl': None, 'profileImageUrl': 'https://pbs.twimg.com/profile_images/623935252357058560/AaeCRlHB_normal.jpg', 'profileBannerUrl': 'https://pbs.twimg.com/profile_banners/185663478/1437592670'}, 'outlinks': ['http://BizNews.com', 'https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter'], 'outlinksss': 'http://BizNews.com https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter', 'tcooutlinks': ['https :// t.co/C1kEhoLaHr', 'https :// t.co/sZT43PBDrM'], 'tcooutlinksss': 'https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'replyCount': 0, 'retweetCount': 0, 'likeCount': 0, 'quoteCount': 0, 'conversationId': 945958273861275648, 'lang': 'en', 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'media': None, 'retweetedTweet': None, 'quotedTweet': None, 'mentionedUsers': None}
Due to the size of the file, I can't use the conversion:
with open(root_dir + 'filename.json', 'r', encoding ='utf-8-sig') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
newdf = pd.read_json(StringIO(data_json_str))
newdf.to_csv(root_dir + 'output.csv')
due to MemoryError. I am trying to use the below generator and write each line to the csv, which should negate the MemoryError issue:
def yield_line_delimited_json(path):
"""
Read a line-delimited json file yielding each row as a record
:param str path:
:rtype: list[object]
"""
with open(path, 'r') as json_file:
for line in json_file:
yield json.loads(line)
new = yield_line_delimited_json(root_dir + 'filename.json')
with open(root_dir + 'output.csv', 'w') as f:
for x in new:
f.write(str(x))
However, the data is not written to the .csv format. Any advice on why the data isn't writing to the csv file is greatly appreciated!