1

I have parsed JSON+LD (structured) data from the tool Screaming Frog. The format this tool exports the data in is unworkable as the parent/child relationship (cross-reference) is not on one row in Excel. Edit: this serialized format is called n-triples. Below is an example output with the index relationships colour-coded (sorry not allowed to post images yet):

https://i.stack.imgur.com/7Zzp3.jpg

Subject     Predicate                           Object
subject27   schema.org/aggregateRating          subject28
subject27   schema.org/offers                   subject29
subject27   schema.org/operatingSystem          ANDROID
subject27   type                                schema.org/SoftwareApplication
subject28   schema.org/ratingCount              15559
subject28   schema.org/ratingValue              3.597853422
subject28   type                                schema.org/AggregateRating
subject29   schema.org/price                    0
subject29   type                                schema.org/Offer

Below would be an example of the final output required where all nested levels are in it's own column. Each of the nested levels (up to 4 deep) should be mapped into its own column, repeating the parent path information.

Predicate L1                Object L1                       Predicate L2            Object L2
type                        schema.org/SoftwareApplication      
schema.org/operatingSystem  ANDROID     
schema.org/aggregateRating  subject28                       schema.org/ratingCount  15559
schema.org/aggregateRating  subject28                       schema.org/ratingValue  3.597853422
schema.org/aggregateRating  subject28                       type                    schema.org/AggregateRating
schema.org/offers           subject29                       schema.org/price        0
schema.org/offers           subject29                       type                    schema.org/Offer

I have looked for existing unflatten solutions, but these either use the path information stored into a single column (with each "lowest level value" having its own "row") or don't rebuilt the original data based upon indices.

I am looking to do this with a combination of for loops with SQL JOINS, but I feel there must be a more elegant solution. This could be in Python, PHP, JS or SQL or a combination or even adding each "subject" into a MongoDB document and then applying a merge operation on this?

Edit: Updating the title to optimize SEO for this article. The serialized format of this RDF and JSON+LD data I am working with is called N-triples. Read more here: https://medium.com/wallscope/understanding-linked-data-formats-rdf-xml-vs-turtle-vs-n-triples-eb931dbe9827

Aldert
  • 15
  • 4
  • Your links are broken. I'm sure you can convey the necessary information without resorting to screenshots. – Tomalak Sep 03 '19 at 16:10
  • thanks for flagging. Fixing + adding on the data without resorting to image (it says more than a 100 words though :) – Aldert Sep 03 '19 at 16:22
  • Giving people some actual text they can copy is always better. And it it increases visibility to search engines. – Tomalak Sep 03 '19 at 16:27
  • 1
    You are right! Moved it into a code snippet. – Aldert Sep 03 '19 at 16:39
  • Is the data in the first sample the original output format from your tool, or just what Excel has turned it into? – Tomalak Sep 03 '19 at 16:56
  • Also would it not make sense to work with the input JSON directly instead of trying to fix the broken output of some tool? – Tomalak Sep 03 '19 at 17:04
  • It is the output of the tool. It includes one more column, namely "URL". So for each URL the tool will crawl it will have all the JSON+LD data in a flat structure. Can grow up to 100s of thousands of rows easily. Additionally the is some validation data in columns on the right. But once I know the best approach to rebuild the levels as in original structure, moving the additional meta-data is easy. – Aldert Sep 03 '19 at 17:04

1 Answers1

0

This is probably all kinds of ugly and very surely un-pythonic in more ways than one, but it gets the job done on your sample data:

import re

def group_items(items, prop):
    group = {}
    for item in items:
        key = item[prop]
        if key not in group:
            group[key] = []
        group[key].append(item)
    return group

with open('input.txt', encoding='utf8') as f:
    # analyze column widths on the example of the header row
    # this allows for flexible column withds in the input data
    header_row = next(f)
    columns = re.findall('\S+\s*', header_row.rstrip('\n'))
    i = 0
    cols = []
    headers = []
    for c in columns:
        headers.append( c.strip() )
        cols.append( [i, i + len(c)] )
        i += len(c)
    cols[-1][1] = 100000   # generous data length for last column

    # extract one item per line, using those column widths
    items = []
    for line in f:
        item = {}
        for c, col in enumerate(cols):
            item[headers[c]] = line[col[0]:col[1]].strip()
        items.append(item)

# group items to figure out which ones are at the root
items_by_subject = group_items(items, 'Subject')
items_by_object = group_items(items, 'Object')

# root keys are those that are not anyone else's subject
root_keys = set(items_by_subject.keys()) - set(items_by_object.keys())
root_items = [items_by_subject[k] for k in root_keys]

# recursive function to walk the tree and determine the leafs
leafs = []
def unflatten(items, parent=None, level=1):
    for item in items:
        item['Parent'] = parent
        item['Level'] = level
        key = item['Object']
        if key in items_by_subject:
            unflatten(items_by_subject[key], item, level+1)
        else:
            leafs.append(item)

# ...which needs to be called for each group of root items
for group in root_items:
    unflatten(group)

# this is not limited to 4 levels
max_level = max(item['Level'] for item in leafs)

# recursive function to fill in parent data
def fill_data(item, output={}):
    parent = item['Parent']
    if parent is not None:
        fill_data(parent, output)
    output['Predicate L%s' % item['Level']] = item['Predicate']
    output['Object L%s' % item['Level']] = item['Object']

# ...which needs to be called once per leaf
result = []
for leaf in reversed(leafs):
    output = {}
    for l in range(1, max_level + 1):
        output['Predicate L%s' % l] = None
        output['Object L%s' % l] = None
    fill_data(leaf, output)
    result.append(output)

# output result
for item in result:
    print(item)

Given your sample input as input.txt, the output is the following me:

{'Predicate L1': 'type', 'Object L1': 'schema.org/SoftwareApplication', 'Predicate L2': None, 'Object L2': None}
{'Predicate L1': 'schema.org/operatingSystem', 'Object L1': 'ANDROID', 'Predicate L2': None, 'Object L2': None}
{'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'type', 'Object L2': 'schema.org/Offer'}
{'Predicate L1': 'schema.org/offers', 'Object L1': 'subject29', 'Predicate L2': 'schema.org/price', 'Object L2': '0'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'type', 'Object L2': 'schema.org/AggregateRating'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingValue', 'Object L2': '3.597853422'}
{'Predicate L1': 'schema.org/aggregateRating', 'Object L1': 'subject28', 'Predicate L2': 'schema.org/ratingCount', 'Object L2': '15559'}

I'll leave putting this into some sort of file as an exercise.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Amazing! Totally different from the basic loops I had in mind, so a great example to learn from! Will have to plug in the data from CSV as some fields will have text with spaces etc (I just shared in this format for readability purposes, will just give the data as is per exact output next time :D). – Aldert Sep 03 '19 at 19:04
  • Take it with a grain of salt, it only represents my level of understanding of the problem, and it's not elegant code by any measure. But by all means, do wrap your head around it, so you are able to adapt it when necessary. – Tomalak Sep 03 '19 at 19:09
  • If the input data is CSV, you can use Python's `csv` module to read it and throw out a good chunk of my code. Next time please always give the input data *exactly* as you have it and the output data *exactly* as needed, it will save time for everybody involved, yourself included. – Tomalak Sep 03 '19 at 19:16
  • Understood! Indeed instead of trying to be helpful, it actually increases work for both. Again thanks so much and will study your solution once the sun is up again. What part would specifically be considered unpythonic? – Aldert Sep 03 '19 at 19:27
  • I don't know. People who are better at writing Python code than I am can probably point out tons of stuff that can be done much more elegantly. :) – Tomalak Sep 03 '19 at 19:29
  • P.S. now you have enough reputation to be able to insert images in posts. Use it sparingly though, using plain text is almost always better. – Tomalak Sep 03 '19 at 19:36
  • Hi @Tomalak, so I found the keyword for this specific format, it is called n-triples (https://en.wikipedia.org/wiki/Resource_Description_Framework#Serialization_formats). I am using your code, but if there would be someone wanting to use a existing solution, it would probably be the route to deserialize to JSON and use this to create the CSV data. – Aldert Sep 06 '19 at 09:59
  • Thanks for the update! Also good to hear that you got it working and that it does what you need. – Tomalak Sep 06 '19 at 10:37