0

Recently I've been having to scrape significantly larger amounts of data and changed from using the feed format 'json' to 'jsonlines' to avoid having it all scrambled and duplicated. The issue is that now none of my programs recognize the exported files as JSON since it removes the beginning and end square brackets and the comma after each item. The first example shows what the data looks like, the second what I would like to achieve.

    {"name": "Color TV", "price": "1200"}
    {"name": "DVD player", "price": "200"}

    ---------------------------------------

    {"data" : [
    {"name": "Color TV", "price": "1200"},
    {"name": "DVD player", "price": "200"},
    {"name": "Color TV", "price": "1200"}
    ]}

Is there a way of manually adding the commas and make it an array while still using the JsonLinesItemExporter?

The only piece of code from my crawler i'd imagine is relevant is my yield keyword but i'm happy to show the full code. I'm not using PHP or MySQL.

Thank you very much in advance.

    yield {
            "name": name,
            "old_price": old_price,
            "discount_price": discount_price
        }
  • 1
    You could wrap or modify `JsonLinesItemExporter`. Or you could wrap the file to translate every `\n` into `,\n`, which may be easier even though it's a lot hackier. But you still need to get the square brackets at start and end, don't you? – abarnert Mar 20 '18 at 18:42
  • Or you could use a streaming JSON encoder instead of a JSONlines encoder. – abarnert Mar 20 '18 at 18:43
  • Both the first solutions seem like they could work. I need to use JSON lines otherwise my data gets scrambled. How would I go about wrapping/modifying the JsonLinesItemExporter? If you have a link to some documentation or an example that would be much appreciated. – Daniel Johnson Maia Mar 20 '18 at 19:02
  • I found the exporters.py file and added in the comma before the /n, thank you very much for your solution. Still trying to find a way to add the square brackets. – Daniel Johnson Maia Mar 20 '18 at 19:23
  • I don't know the API for the lib, or how you're using it, but… if you just create a file, wrap an exporter around it, keep it open until the end, close it at shutdown, and never write to that file again, it should be pretty easy—just add the `[\n` in `__init__` and `]\n` in `close`. If you need to reopen and append to the same file repeatedly, that's a bit more complicated, and you'd probably want to do it out of band instead of as part of the exporter. – abarnert Mar 20 '18 at 19:57

1 Answers1

2

First, the commas.

The nicest solution would be to wrap JsonLinesItemExporter so that it adds a comma at the end of each item.

If the appropriate method isn't exposed in a way that you can override it, super it, and add the comma, you may have to reimplement the method in your subclass, or even monkeypatch the exporter class. Less nice.

Alternatively, you can hook the file you pass into the exporter to make writes do a replace('\n', ',\n'). This is hacky, so I wouldn't do it if you can hook the exporter instead, but it does have the virtue of being simple.


Now, the brackets at start and end of file. Without knowing the library you're using or the way you're using it, this will be pretty vague.

If you're using a single "session" of the exporter per file—that is, you open it at startup, write a bunch of items to it, then close it, and never re-open it and append to it, this is pretty easy. Let's assume you solved the first problem by subclassing the exporter class to hook its writes, something like this:

class JsonArrayExporter(JsonLinesItemExporter):
    def _write_bytes(self, encoded_bytes):
        encoded_bytes = _encoded_bytes.replace(b'\n', b',\n')
        returns super()._write_bytes(encoded_bytes)

I'm guessing at what the implementation looks like, but you've already discovered the right thing to do, so you should be able to translate from my guess to reality. Now, you need to add two methods like this:

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._writebytes(b'[\n')

    def close(self):
        if not self.closed():
            self._writebytes(b']\n')
        super().close()

You may need a flush somewhere before the _writebytes if the exporter class has its own buffer inside it, but that's the only extra complexity I'd expect to see.

If you're reopening files and appending to them in each session, this obviously won't work. You could do something like this pseudocode in __init__:

if file is empty:
    write('[\n')
else:
    seek to end of file
    if last two bytes are ']\n':
        seek back 2 bytes

That has the advantage of being transparent to your client code, but it's a bit hacky. If your client code knows when it's opening a new file rather than appending to an old one, and knows when it's finishing off a file for good, it's probably cleaner to add addStartMarker and addEndMarker methods and call those, or just have the client manually write the brackets to the file before initializing/after closing the exporter.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I'm fairly certain you just solved my problem by telling me to subclass the exporter class and add the overrides. I'll try to see if that works soon as I am able to. I am not reopening files and appending once they've been closed, so your solution is ideal. Do you know where I might find some documentation or an example showing adding `_init_` and `close` overrides and using them to write something? I've got fairly simple code to parse the data, just a for loop with css selectors and then a yield function to define the 3 fields i'm using. Would I add the overrides in that parse function? – Daniel Johnson Maia Mar 21 '18 at 13:33
  • 1
    @DanielJohnsonMaia I've added sample code for `__init__` and `close` overrides. The exact details are unlikely to be the same in your real code, but since you didn't provide your real code or a link to the library you're using, I have to guess something. – abarnert Mar 21 '18 at 16:54