1

I have more than a million images those I will like to use as training data. How do I make this data available freely without compromising security?

I want the users to be able to use it quickly for training purpose, without giving hackers a chance to rebuild images from the open source data. At the same time I do not want that the training quality will be affected in any way.

In other words how do I safely open-source images?


For e.g. This code generates numpy array. I just want to make it very difficult to reconstruct the original image from the ndarray "x" in this case.

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
i = load_img('some_image.jpg' )
x = img_to_array(i)
x = x.reshape((1,) + x.shape)

I can share the array x once I know that the hackers can not use the data and create the same image.

shantanuo
  • 31,689
  • 78
  • 245
  • 403
  • 2
    What do you mean by: "without giving hackers a chance to rebuild images from the open source data"? When you say publish, you mean available on internet with or without access restrictions? Could you elaborate about the term source to better define it in your context. Also do you have to address intellectual property or copyright concerns? – jlandercy Apr 28 '19 at 06:26
  • One way to do it is to train a gan style model which sample from a random noise vector z. So you don't actually release the images directly. The generation quality can be improved if you have conditioning attributes such as hair styles for human faces. But it's a bit tricky to say if such a model can successfully encode the capacity of 1 million images,ie the training accuracy of using such generated models might very well be lower than using actual images. You can also do some kind of clustering in the original dataset and release multiple models. Just a thought – Zaw Lin Apr 28 '19 at 19:15
  • 1
    Maybe an XY problem. What I cannot understand at the moment is what you mean by safely open-source images.There should not have limitation when distributing (anyone can have access to open-source, it is direclty linked to the concept itself). By using the keyword safe, it may point that you have sensitive pictures you do not want or cannot disclose (GDPR, IP, copyright, etc.), therefore how could you release them in open-source? – jlandercy Apr 29 '19 at 14:03
  • @jlandercy, just to clarify, by noise vector z, i do not mean adversarial type of noise or any form of spatial nosie at all. It is a gaussian noise distributioni where you map an image to this distribution. Its a pretty standard technique in variational auto encoders. I know for a fact that such techniques have been employed in companies in medical field where they cannot disclose original images due to privacy concerns however would like to give the images to another company for a different task. In a sense, images are just a distribution. So you just model that distribution using a network. – Zaw Lin Apr 30 '19 at 09:44

2 Answers2

6

If you aim to publish open-source pictures, a good start would be to understand how WikiCommons works. They had and must face many challenges of this kind, there is a lot of things to learn from there.

If your audience needs the complete picture to be served to make their models work, then no matter how you try to obfuscate the array containing the data. Smart guys that have enough time and creativity will be able to reconstruct the original picture. This is not a viable solution, it only provides a false secure feeling.

If you choose a destructive approach, not to serve the actual picture, but some digest/hash/fingerprint of it, then you will probably reduce the risk of reconstructing the original picture (beware there are very clever guys with strong cryptographic skills). But then your audience will not be able to learn from the picture itself so you may not achieve your goal.

Less destructive and may not fit your requirement: adding noise. It will not prevent disclosure of sensitive material (human eyes and brain are somehow good to classification) and it is a well know technique for AI confusion. Not a good solution too.

In anyway, if you serve without care sensitive material that does not fit open source, then you may get yourself and other people in trouble. This is not a good option.

My advice,

  • If your pictures really suit to open source policy, then serve them as this and do not worry about hackers, they are customers as well;
  • If your picture are sensitive, then do not serve them as open source. Instead provide a framework with a layer of security and implement required regulations you must take into account (ToS, IP, Copyright, GDPR).
jlandercy
  • 7,183
  • 1
  • 39
  • 57
0

All machine learning algorithms take the real images and convert the images to tensors, and process them in batches (multiple images at a time).

Couple of options for you:

  • You can share your images with your teammates and relay on trust.
  • You can somehow obfuscate the images as bunch of files, or you can create the algorithm to convert them to numpy array (or tensor), obfuscate them, and provide the procedure to revert them back without losses.

But in all these cases, non wanted people can somehow guess your procedure/obfuscation.

Ideal would be to create the Machine Learning model (like VGG, ResNet, Inception) from your images, and then you can distribute your model that learned what you planed from your images.

Bottom line, in ML you need images to learn something from them, and not the images per se.

Privacy is really a problem as we can see from this document dealing with how copyright is causing a decay in public datasets.

There are no many solutions to this problem, because privacy really matters. However, this idea with GANs may be encouraging.

If you don't use GANs, it is hard to tell what would be the right set of transforms you would need to undertake to escape the privacy policy concerns.

Just if you try to flip images, scale them, remove the metadata, normalize them, or transform one pixel is not enough. You would need make them indistinguishable from the originals.

prosti
  • 42,291
  • 14
  • 186
  • 151
  • No one can reconstruct the images from a trained model. But I should not open source training data if the images are private. Is this statement correct? – shantanuo May 02 '19 at 13:34
  • If the images are private this adheres to privacy concerns that are different depending on the country you live/work in. – prosti May 02 '19 at 13:50
  • "No one can reconstruct the images from a trained model." This is correct @shantanuo. – prosti May 03 '19 at 14:09