2

We're creating gamma-cat, an open data collection for gamma-ray astronomy, and are looking for advice (here, or links to resources, formats, tools, packages) how to best set it up.

The data we have consists of measurements for different sources, from different papers. It's pretty heterogeneous, sometimes there's data for multiple sources in one paper, for each source there's usually several papers, sometimes there's no spectrum, sometimes one, sometimes many, ...

Currently we just collect the data in an input folder as YAML and CSV files, and now we'd like to expose it to users. Mainly access from Python, but also from Javascript and accessible from a static website.

The question is what format and organisation we should use for the data, and if there's any Python packages that will help us generate the output files as a set of linked data, as well as Python and Javascript packages that will help us access it?

We would like to get multiple "views" or simple "queries" of the data, e.g. "list of all sources", "list of all papers", "list of all spectra for source X", "spectrum A from paper B for source C".

For format, probably JSON would be a good choice? Although YAML is a bit nicer to read, and it's possible to have comments and ordered maps. We're storing the output files in a git repo, and have had a lot of meaningless diffs for JSON files because key order changes all the time.

To make the datasets discoverable and linked, I don't know what to use. I found e.g. http://jsonapi.org/ but that seems to be for REST APIs, not for just a series of flat JSON files on a static webserver? Maybe it could still be used that way? I also found http://json-ld.org/ which looks relevant, but also pretty complex. Would either of those or something else be a good choice?

And finally, we'd like to generate the linked and discoverable files in output from just a bunch of somewhat organised YAML and CSV files in input using Python scripts. So far we just wrote a bunch of Python classes or scripts based on Python dicts / lists and YAML / JSON files. Is there a Python package that would help with that task of generating the linked data files?

Apologies for the long and complex question! I hope it's still in scope for SO and someone will have some advice to share.

Christoph
  • 2,790
  • 2
  • 18
  • 23
  • You have asked multiple questions here, none of which are in scope for SO. – jonrsharpe Jan 09 '17 at 09:09
  • @jonsharpe - Apologies! Is there another forum besides SO where that question would be OK to ask? The question is long, but IMO could be answered with just a few lines, pointing to other projects that did something similar, and mentioning what formats / tools they used. That would be very helpful to me. – Christoph Jan 09 '17 at 09:12
  • Not on the SE network, as far as I'm aware, maybe a forum would be a better bet. A "list question" like that is not a good fit for SO. – jonrsharpe Jan 09 '17 at 09:19

1 Answers1

0

Judging from the breadth of your question, you are new to linked data. The least "strange" format for you might be the Data Package. In the most common case it's just a zip archive of a CSV file and JSON metadata. It has a Python package.

If you have queries to the data, you should settle for a database (triplestore) with a SPARQL endpoint. Take a look at Fuseki. You can then use Turtle or RDF/XML for file export.

If the data comes from some kind of a tool, you can model the domain it represents using Eclipse Lyo (tutorial).

These tools are maintained by 3 different communities, you can reach out to their user mailing lists separately if you have further questions about them.

berezovskyi
  • 3,133
  • 2
  • 25
  • 30
  • > Judging from the breadth of your question, you are new to linked data. Oh yes. Thanks for the answer! I'll check it out in the coming ~ day. – Christoph Jan 09 '17 at 13:37
  • Is there an example of a "data package" with many interlinked files? Do the links ("path") always go from the central "datapackage.json" to each file, or can there be links from different files to other files? – Christoph Jan 10 '17 at 00:33
  • Currently not, multiple files in the data package have to be of the same shape and form: http://specs.frictionlessdata.io/data-packages/#data-in-multiple-files – berezovskyi Jan 10 '17 at 11:06
  • Then "data package" doesn't work for us. We have different pieces of data (e.g. a few different types of tables, but also JSON/YAML files), see files in https://github.com/gammapy/gamma-cat/tree/master/input/data). What's the "next simplest" format that can expose this data as a linked dataset I could look at? – Christoph Jan 10 '17 at 11:19
  • I would suggest considering converting all data into RDF and using a file type like http://www.rdfhdt.org/. It should also work well for large volumes of measurement data. As a benefit, you're getting into proper linked data and gain the possibility to execute SPARQL queries for "free". – berezovskyi Apr 25 '17 at 10:06