what compression algorithm to use for highly redundant data

Question

This program uses sockets to transfer highly redundant 2D byte arrays (image like). While the transfer rate is comparatively high (10 Mbps), the arrays are also highly redundant (e.g. Each row may contain several consequently similar values). I have tried zlib and lz4 and the results were promising, however I still think of a better compression method and please remember that it should be relatively fast as in lz4. Any suggestions?

You have tagged "image-compression". Is the data you are compressing a stream of images? If so I would suggest either you use Lossless Video/Image codecs. — Aron, Aug 30 '13 at 16:34
The data are not real images, however they meet all the requirements to get behaved like images and I have taken a look at lossless video codecs, however the data is generated on real time and video codecs tend to be slow in compression phase. — beebee, Aug 30 '13 at 16:41
Try giving [this paper](https://www.usenix.org/legacy/event/fast11/tech/full_papers/Meyer.pdf) a read. — jxh, Aug 30 '13 at 18:17
Thanks jxh, I reviewed the paper. I am not sure how exactly is it related to the problem. — beebee, Aug 30 '13 at 19:36

score 4 · Answer 1 · answered Aug 30 '13 at 17:18

4

You should look at the PNG algorithms for filtering image data before compressing. They are simple to more sophisticated methods for predicting values in a 2D array based on previous values. To the extent that the predictions are good, the filtering can make for dramatic improvements in the subsequent compression step.

You should simply try these filters on your data, and then feed it to lz4.

answered Aug 30 '13 at 17:18

Mark Adler

101,978
13
118
158

Thanks Mark, I got the concept and I think the concept of neighbour pixels could be extended to more than the surrounding 1 pixel neighbours...I am thinking of having a windows of distance n pixels in all directions and then perhaps use filter type 3... – beebee Aug 30 '13 at 17:43
but still I'm not sure how 1- find the n in reasonable time and 2- what to do with the edges... – beebee Aug 30 '13 at 17:47
For the edges, treat it as if the array were surrounded by zeros. – Mark Adler Aug 31 '13 at 15:22
1

Start with the distance 1 filters and see how far that gets you before trying to use more previous data. In general you will get diminishing returns and even worse compression as you look farther. – Mark Adler Aug 31 '13 at 15:23

LemonCool · Answer 2 · 2013-08-30T17:40:27.613

you could create your own, if the data in rows is similar you can create a resource / index map thus reducing substantial the size, something like this

Original file:
row 1: 1212, 34,45,1212,45,34,56,45,56
row 2: 34,45,1212,78,54,87,....

you could create a list of unique values, than use and index in replacement,

34,45,54,56,78,87,1212

row 1: 6,0,2,6,1,0,.....

this can potantialy save you over 30% or more data transfer, but it depends on how redundant the data is

UPDATE

Here a simple implementation

std::set<int> uniqueValues
DataTable my2dData; //assuming 2d vector implementation
std::string indexMap;
std::string fileCompressed = "";

int Find(int value){
  for(int i = 0; i < uniqueValues.size; ++i){
     if(uniqueValues[i] == value) return i;
  }
  return -1;
}

//create list of unique values
for(int i = 0; i < my2dData.size; ++i){
  for(int j = 0; j < my2dData[i].size; ++j){
     uniqueValues.insert(my2dData[i][j]);
  }
}    

//create indexes
for(int i = 0; i < my2dData.size; ++i){
  std::string tmpRow = "";
  for(int j = 0; j < my2dData[i].size; ++j){
     if(tmpRow == ""){ 
       tmpRow = Find(my2dData[i][j]);     
     }
     else{
       tmpRow += "," + Find(my2dData[i][j]);
    }
  }
  tmpRow += "\n\r";
  indexMap += tmpRow;
}

//create file to transfer
for(int k = 0; k < uniqueValues.size; ++k){
  if(fileCompressed == ""){ 
       fileCompressed = "i: " + uniqueValues[k];     
     }
     else{
       fileCompressed += "," + uniqueValues[k];
    }
}
fileCompressed += "\n\r\d:" + indexMap;

now on the receiving end you just do the opposite, if the line start with "i" you get the index, if it start with "d" you get the data

Thanks Fabrizio. I have something similar in mind, However, before implementing such method, I am looking for a standard compression algorithm designed for redundant (with the specific pattern as mentioned) data. — beebee, Aug 30 '13 at 16:44
I think @Fabrizio is right, but I guess zlib is also a quite acceptable solution of your problem. You need to find the balance point between high performance and high complexity. — Netherwire, Aug 30 '13 at 16:52
the library you mentioned do a pretty good job at it, but as any general purpose library are implemented to be "general" which may not be the best for all situations, the example I provided you is used by the .obj 3d data files format, and shouldn't take long to implement and is quiet powerful http://en.wikipedia.org/wiki/Wavefront_.obj_file — LemonCool, Aug 30 '13 at 16:53

what compression algorithm to use for highly redundant data

2 Answers2