2

We're working with an API-based data provided that allows us to analyze large sets of GIS data in relation to provided GeoJSON areas and specified timestamps. When the data is aggregated by our provider, it can be marked as complete and alert our service via a callback URL. From there, we have a list of the reports we've run with their relevant download links. One of the reports we need to work with is a TSV file with 4 columns, and looks like this:

deviceId | timestamp | lat | lng

Sometimes, if the area we're analyzing is large enough, these files can be 60+GB large. The download link links to a zipped version of the files, so we can't read them directly from the download URL. We're trying to get the data in this TSV grouped by deviceId and sorted by timestamp so we can route along road networks using the lat/lng in our routing service. We've used Javascript for most of our application so far, but this service poses unique problems that may require additional software and/or languages.

Curious how others have approached the problem of handling and processing data of this size.

We've tried downloading the file, piping it into a ReadStream, and allocating all the available cores on the machine to process batches of the data individually. This works, but it's not nearly as fast as we would like (even with 36 cores).

Fell
  • 135
  • 2
  • 8
  • I would start by doing some rough math to determine how fast each file entry has to be processed (for a given file size) in order for the time consumed to be within your acceptable range. Your problem involves dealing with a data set and the time consumed is going to be intrinsically proportional to the size; I don't think there's any way around that. – Pointy Jun 06 '19 at 19:15

1 Answers1

1

From Wikipedia:

Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because ... only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures.

In other words, if you try to do it without looking at the end of the zip file first, you may end up accidentally including deleted files. So you can't trust streaming unzippers. However, if the zip file hasn't been modified since it was created, perhaps streaming parsers can be trusted. If you don't want to risk it, then don't use a streaming parser. (Which means you were right to download the file to disk first.)

To some extent it depends on the structure of the zip archive: If it consists of many moderately sized files, and if they can all be processed independently, then you don't need to have very much of it in memory at any one time. On the other hand, if you try to process many files in parallel then you may run into the limit on the number of filehandles that can be open. But you can get round this using something like queue.

You say you have to sort the data by device ID and timestamp. That's another part of the process that can't be streamed. If you need to sort a large list of data, I'd recommend you save it to a database first; that way you can make it as big as your disk will allow, but also structured. You'd have a table where the columns are the columns of the TSV. You can stream from the TSV file into the database, and also index the database by deviceId and timestamp. And by this I mean a single index that uses both of those columns, in that order.

If you want a distributed infrastructure, maybe you could store different device IDs on different disks with different CPUs etc ("sharding" is the word you want to google). But I don't know whether this will be faster. It would speed up the disk access. But it might create a bottleneck in network connections, through either latency or bandwidth, depending on how interconnected the different device IDs are.

Oh, and if you're going to be running multiple instances of this process in parallel, don't forget to create separate databases, or at the very least add another column to the database to distinguish separate instances.

Community
  • 1
  • 1
David Knipe
  • 3,417
  • 1
  • 19
  • 19