I am trying to find a software solution that will allow our group to easily upload datasets (scriptable and or through some UI), tag those datasets, retrieve those datasets, access control for the datasets, search the tags, search the files name/attributes/metadata (e.g. file creation date). The datasets can be anything from CSV files, image(binary) datasets, texts, server logs, folders within folders of images, zip files of csv data. It can be anything. We will need to be storing GBs to potentially PBs of data. A single file can range from a few KB to 100's of GB. Usable API to retrieve these datasets programmatically.
We just want to have a centralized location of finding information and we want to be able to answer a question such as "Hey do you know if we have any lightening strike datasets?" If there is a file/folder/zip file tagged with "lightening" when I search it should pull back that dataset.
A possible solution would be something like Dataverse, Dspace, Fedora Commons, CKAN. However, those seem to be really geared towards academia and publications or small datasets. On top of that they remove any type of complex folder structure that might exist (e.g. Folder1-->subFolder1-->subFolder2). I also question the scalability of having a 10 million 100kb files within one of these systems.
A filesystem share would allow us to simply store whatever we want but I don't know of a reasonable way of enabling tagging of data.
It is almost like I am looking for a combination of the two. Does someone know of a tool preferably open source that would be able to do something like this?