We're working with an API-based data provided that allows us to analyze large sets of GIS data in relation to provided GeoJSON areas and specified timestamps. When the data is aggregated by our provider, it can be marked as complete and alert our service via a callback URL. From there, we have a list of the reports we've run with their relevant download links. One of the reports we need to work with is a TSV file with 4 columns, and looks like this:
deviceId | timestamp | lat | lng
Sometimes, if the area we're analyzing is large enough, these files can be 60+GB large. The download link links to a zipped version of the files, so we can't read them directly from the download URL. We're trying to get the data in this TSV grouped by deviceId and sorted by timestamp so we can route along road networks using the lat/lng in our routing service. We've used Javascript for most of our application so far, but this service poses unique problems that may require additional software and/or languages.
Curious how others have approached the problem of handling and processing data of this size.
We've tried downloading the file, piping it into a ReadStream, and allocating all the available cores on the machine to process batches of the data individually. This works, but it's not nearly as fast as we would like (even with 36 cores).