I have a record type and a dynamic array made up of that record type. I pass it to a mergesort routine and try to set one of it's field properties which is boolean to true but seems does not take effect.
I looked into sorting array of record by other means(see this quicksort for customrecord array: http://en.wikibooks.org/wiki/Algorithm_Implementation/Sorting/Quicksort#Delphi) or here: Best way to sort an array (I could not get none of these suggestions to work from here mostly because of creating a comaring function). This question: Sorting of Arrays Alphabetically? was helpful and works but this sorting is excruciatingly slow.
CODE:
type
TCustomRecord = Record
fLine : AnsiString; //full line
fsubLine : AnsiString; // part of full line
isDuplicate : boolean; //is that subline duplicate in another line
isRefrence : boolean; // is this line from a refrence file or the one being deduped
fIndex : Cardinal; // original order line was loaded
end;
TCustomRecordArray = array of TCustomRecord;
function Merge2(var Vals: array of TCustomRecord ):Integer;
var
AVals: array of TCustomRecord;
//returns index of the last valid element
function Merge(I0, I1, J0, J1: Integer):Integer;
var
i, j, k, LC:Integer;
begin
LC := I1 - I0;
for i := 0 to LC do
AVals[i]:=Vals[i + I0];
//copy lower half or Vals into temporary array AVals
k := I0;
i := 0;
j := J0;
while ((i <= LC) and (j <= J1)) do
if (AVals[i].fsubLine < Vals[j].fsubLine) then
begin
Vals[k] := AVals[i];
if Vals[k].isRefrence = False then
Vals[k].isDuplicate := False;
inc(i);
inc(k);
end
else if (AVals[i].fsubLine > Vals[j].fsubLine) then
begin
Vals[k]:=Vals[j];
if Vals[k].isRefrence = False then
Vals[k].isDuplicate := False;
inc(k);
inc(j);
end else
begin //duplicate
Vals[k] := AVals[i];
if Vals[k].isRefrence = False then
Vals[k].isDuplicate := True;
inc(i);
inc(j);
inc(k);
end;
//copy the rest
while i <= LC do begin
Vals[k] := AVals[i];
inc(i);
inc(k);
end;
if k <> j then
while j <= J1 do begin
Vals[k]:=Vals[j];
inc(k);
inc(j);
end;
Result := k - 1;
end;
//returns index of the last valid element
function PerformMergeSort(ALo, AHi:Integer): Integer; //returns
var
AMid, I1, J1:Integer;
begin
//It would be wise to use Insertion Sort when (AHi - ALo) is small (about 32-100)
if (ALo < AHi) then
begin
AMid:=(ALo + AHi) shr 1;
I1 := PerformMergeSort(ALo, AMid);
J1 := PerformMergeSort(AMid + 1, AHi);
Result := Merge(ALo, I1, AMid + 1, J1);
end else
Result := ALo;
end;
begin
//SetLength(AVals, Length(Vals) + 1 div 2);
SetLength(AVals, Length(Vals) div 2 + 1);
Result := 1 + PerformMergeSort(0, High(Vals));
end;
QUESTION: How can I sort efficiently, preferably using mergesort, this array of record and set some of it's properties according to that sort? Thank you.
UPDATE: I added a pointer type and did a modified mergesort on array of pointers. This turned out to be very fast way of sorting the array of record. I added also a compare routine which added the flags I needed. The only part I am not able to do is to add a flag for duplicates based on if they belonged to file A or Reference file.
CODE:
type
PCustomRecord = ^TCustomRecord;
TCustomRecord = Record
fLine : AnsiString; //full line
fsubLine : AnsiString; // part of full line
isDuplicate : boolean; //is that subline duplicate in another line
isRefrence : boolean; // line from a refrence file or the one being deduped
isUnique : boolean; //flag to set if not refrence and not dupe
fIndex : Cardinal; // original order line was loaded
end;
TCustomRecordArray = array of TCustomRecord;
PCustomRecordList = ^TCustomRecordArray;
//set up actual array
//set up pointer array to point at actual array
//sort by mergesort first
// then call compare function - this can be a procedure obviously
function Compare(var PRecords: array of PCustomRecord; iLength: int64): Integer;
var
i : Integer;
begin
for i := 0 to High(PRecords) do
begin
Result := AnsiCompareStr(PRecords[i]^.fsubline, PRecords[i+1]^.fsubline);
if Result=0 then
begin
if (PRecords[i].isrefrence = False) then
PRecords[i].isduplicate := True
else if (PRecords[i+1].isrefrence = False) then
PRecords[i+1].isduplicate := True;
end;
end;
end;
procedure MergeSort(var Vals:array of PCustomRecord;ACount:Integer);
var AVals:array of PCustomRecord;
procedure Merge(ALo,AMid,AHi:Integer);
var i,j,k,m:Integer;
begin
i:=0;
for j:=ALo to AMid do
begin
AVals[i]:=Vals[j];
inc(i);
//copy lower half or Vals into temporary array AVals
end;
i:=0;j:=AMid + 1;k:=ALo;//j could be undefined after the for loop!
while ((k < j) and (j <= AHi)) do
if (AVals[i].fsubline) <= (Vals[j].fsubline) then
begin
Vals[k]:=AVals[i];
inc(i);inc(k);
end
else if (AVals[i].fsubline) > (Vals[j].fsubline) then
begin
Vals[k]:=Vals[j];
inc(k);inc(j);
end;
{locate next greatest value in Vals or AVals and copy it to the
right position.}
for m:=k to j - 1 do
begin
Vals[m]:=AVals[i];
inc(i);
end;
//copy back any remaining, unsorted, elements
end;
procedure PerformMergeSort(ALo,AHi:Integer);
var AMid:Integer;
begin
if (ALo < AHi) then
begin
AMid:=(ALo + AHi) shr 1;
PerformMergeSort(ALo,AMid);
PerformMergeSort(AMid + 1,AHi);
Merge(ALo,AMid,AHi);
end;
end;
begin
SetLength(AVals, ACount div 2 + 1);
PerformMergeSort(0,ACount - 1);
end;
This is all very fast on small files taking less than one second. Deduping the items in the array that carry a duplicate flag and NOT a reference flag is quite challenging though. As mergesort is a stable sort I tried resorting by boolean flag but did not get what I expected. I used a TStringlist
to see if my previous flags are being set up correctly and it works perfectly. The time went up from 1 second to 6 seconds. I know there has to be an easy way to mark the isUnique
flag without TStringlist
.
Here is what I tried:
function DeDupe(var PRecords: array of PCustomRecord; iLength: int64): Integer;
var
i : Integer;
begin
for i := 0 to High(PRecords) do
begin
if (PRecords[i]^.isrefrence = False) and (PRecords[i+1]^.isrefrence = false)then
begin
Result := AnsiCompareStr(PRecords[i]^.isduplicate, PRecords[i+1]^.isduplicate);
if Result = 0 then PRecords[i]^.isUnique := True;
end
else
begin
Continue;
end;
end;
end;
This doesn't get all the values and I did not see a difference with it as I still see lots of duplicates. I think the logic is wrong.
Thanks to all the great souls helping out. To all please allow me the benefit that I may already know how to derive a TObject
and how to use a TStringList
so the focus is on arrays.
QUESTION: Help me do a function or procedure as above to mark the repeated items with the: isRefrence = false and isDuplicate = True and unique
EDIT 3:
I was able to achieve the elimination of duplicates through the use of boolean flags. this helped in keeping the array stable without changing the size of the array. I believe it is much much faster than using TList
descendant or TStringList
. The use of a basic container such as an array has limitations in ease of coding but is very efficient so I would not pass on it. The pointers made the sorting a breeze. I'm not sure how after I set the pointers to my array when i used the pointer array exactly like I'm using my regular array. And it made no difference whether I derefrenced it or not. I set up the pointer array as such:
iLength := Length(Custom_array); //get length of actual array
SetLength(pcustomRecords, iLength); // make pointer array equal + 1
for M := Low(Custom_array) to High(Custom_array) do //set up pointers
begin
pcustomRecords[M] := @Custom_array[M];
end;
I tried seperating the sorting from the actual data being sorted as much as I can, but I'm sure there can be improvement.
///////////////////////////////////////////////////////////////////
function Comparesubstring(Item1, Item2: PCustomRecord): Integer;
begin
Result := AnsiCompareStr(item1^.fsubline, item2^.fsubline);
end;
///////////////////////////////////////////////////////////////////
function CompareLine(Item1, Item2: PCustomRecord): Integer;
begin
Result := AnsiCompareStr(item1^.fLine, item2^.fLine);
end;
///////////////////////////////////////////////////////////////////
function Compare(var PRecords: array of PCustomRecord; iLength: int64): Integer;
var
M, i : Integer;
begin
M := Length(PRecords);
for i := 1 to M-1 do
begin
Result := Comparesubstring(PRecords[i-1], PRecords[i]);
if Result=0 then
begin
if (PRecords[i-1].isRefrence = False) then
PRecords[i-1].isduplicate := True
else if (PRecords[i].isRefrence = False) then
PRecords[i].isduplicate := True;
end;
end;
end;
///////////////////////////////////////////////////////////////////