The Delphi Geek: 10/2006

Friday, October 27, 2006

Faster than the speed of binary search

[Updated 2006-10-28, new text and code at the end.]

Everybody (at least everybody competent enough to be worth mentioning) know how to search in sorted array with the binary search algorithm. But everybody (or almost everybody) thinks that this is the fastest approach to searching. Is everybody (or almost everybody) correct regarding that topic? It depends.

On general, the answer is true - binary search is fastest. But that only holds if we know nothing about the data stored in the array - in other words, if we pretend that it is random (but sorted, of course).

Want an example? Take an array [1..99] of byte. Fill it with numbers in the range 1..100 but leave (random) one out. Use each number only once. Sort it. You'll get an array with first N numbers in the same position as the array index (A[i] = i for all i in 1..N) and the rest of it will be off by one (A[i] = i+1 for all i in N+1..99). Do you think that binary search is fastest possible search in that case? Of course no - to find a random number (say m) you only have to check A[m] and A[m-1]. Two lookups, that's all.

Are there more general guidelines to speeding up a binary search. Yes, it's quite simple - we only have to be able to guess the position of the value we're searching for better than with the "it's somewhere in the middle" approach. In a way, that's exactly what we did in the example above.

This approach is very simple to use if data in the array is distributed uniformly. In that case, we can take the value at low and high end of the interval (not the indices as in the binary search, but values at those indices) and use a linear interpolation to guess a possible position of the value.

Something similar happend to me last week. I was solving some DVB-related problem and built an array of video fragments (video PES, if you know anything about the DVB internals) in a video file (actually I only stored file offset for each fragment). Later, I needed to find (many times) a fragment closest to some file offset (every time a different offset). Of course, I could use a simple binary search, but it was such a nice opportunity to try out a better approach and run some tests ...

For test data I used a small, 40 MB video file. There were 51737 entries in the array and the code executed 3128 searches. With standard bisection, the code did 15.45 rounds through the bisection loop per search (on average). With interpolated search, this number was decreased to 4.13 rounds per search. That's better by a factor 3.7! And in the real life, the file will be some 50x larger - around 2 GB - and difference will be even bigger.

The number of rounds/number of searches distribution for both approaches:

We can see that standard binary search needed 16 rounds for majority of the searches and almost no searches were completed in less than 15 rounds.

On the other hand, interpolated binary search shows almost gaussian distribution from 2 to 6 rounds.

But as the code is more complicated (see below), the total speedup was not that impressive, only somewhere around 1.6 times. Still, it was a personal satisfaction to see the theoretically better code run faster in reality, too :)

function TGpTSPIDIndex.Find(transportStreamOffset: int64; var index: integer): boolean;
var
  L, H, I, C: integer;
begin
  {More work is needed to check for boundary conditions}
  if Count = 0 then begin
    index := 0;
    Result := false;
  end
  else if transportStreamOffset <= Chains[0].TransportStreamOffset then begin
    index := 0;
    Result := (transportStreamOffset = Chains[0].TransportStreamOffset);
  end
  else if transportStreamOffset = Chains[Count-1].TransportStreamOffset then begin
    index := Count-1;
    Result := true;
  end
  else if transportStreamOffset >= Chains[Count-1].TransportStreamOffset then begin
    index := Count;
    Result := false;
  end
  else begin
    Result := false;
    L := 0;
    H := Count - 1;
    while Chains[L].TransportStreamOffset <= Chains[H].TransportStreamOffset do begin
      //instead of
      //I := (L + H) shr 1;
      //we're using
      if L = H then
        I := L
      else begin
        I := Round((transportStreamOffset - Chains[L].TransportStreamOffset) /
             (Chains[H].TransportStreamOffset - Chains[L].TransportStreamOffset) * (H-L)) + L;
        if I < L then
          I := L
        else if I > H then
          I := H;
      end;
      //offsets are int64 numbers
      C := Int64Compare(Chains[I].TransportStreamOffset, transportStreamOffset);
      if C < 0 then
        L := I + 1
      else begin
        H := I - 1;
        if C = 0 then begin
          Result := true;
          L := I;
        end;
      end;
    end;
    index := L;
  end;
end; { TGpTSPIDIndex.Find }

[Updated 2006-10-28]

While thinking on the comment posted by the anonymous I looked at the code from a new view and suddenly I realized that I'm an idiot. The 'while' condition in the bisection loop in my version should not differ from the condition in standard binary search! When I put this right, I could suddenly remove all extra tests at the beginning as they were no longer needed. New code (tested to work the same as the old code and as the standard binary search - at least on my data) is posted below.

A word of warning: all this works only if there are no repetitive elements in the sorted array! If this condition is not satisfied, you'll soon get a 'division by 0' exception in the calculation of the weighted midpoint.

function TGpTSPIDIndex.Find(transportStreamOffset: int64; var index: integer): boolean;
var
  L, H, I, C: integer;
begin
  Result := false;
  L := 0;
  H := Count - 1;
  while L <= H do begin
    //standard binary search
    //I := (L + H) shr 1;
    //weighted version
    if L = H then
      I := L
    else begin
      I := Round((transportStreamOffset - Chains[L].TransportStreamOffset) /
           (Chains[H].TransportStreamOffset - Chains[L].TransportStreamOffset) * (H-L)) + L;
      if I < L then
        I := L
      else if I > H then
        I := H;
    end;
    C := Int64Compare(Chains[I].TransportStreamOffset, transportStreamOffset);
    if C < 0 then
      L := I + 1
    else begin
      H := I - 1;
      if C = 0 then begin
        Result := true;
        L := I;
      end;
    end;
  end;
  index := L;
end; { TGpTSPIDIndex.Find }

Tuesday, October 17, 2006

Delphi to the rescue (and one little Win32 API call)

A new day, a new hardware to support ...

I plugged the board (DVB receiver/transmitter in case you are interested) into the computer, powered it up and got an "interesting" hardware wizard window

This one is localized into Slovenian, but you probably know it by heart - it is a standard Windows wizard for adding new hardware devices. Except that this time it came with a twist - for some reason the window was only 200 pixels high. Did I have to mention that this wizard is not resizable? Probably not. (I hate non-resizable applets!)

So what did I do? Reboot, of course. You can probably guess that it didn't help (or I wouldn't be writing this). HW wizard appeared after the next boot and it was completely the same as in the picture above.

An ordinary user would probably start kicking the computer, blaming it all on the Microsoft and calling his/her best friend that supposedly knows "everything" about computers. Me? I kicked the computer, blamed it all on the Microsoft and fired up Delphi 2006.

First, I started one of my old applications that lists all windows on the system (using EnumWindows API) and located window handle of the HW wizard. I could also do this with Microsofts Spy++ but at that moment I was positively mad at guys in Redmond and didn't want to use their tools.

Next I wrote a small app that called SetWindowPos API with this handle and bigger window boundaries. And, behold, HW wizard was magically enlarged! Sometimes it's really good to know your Win32 API ...

OK, so I lied a little. I didn't wrote a small app that called ... etc. as I already had something lying handly. An old app of mine that searches for a window with specified caption and resizes it. As it is a burden to type "Čarovnik za najdeno novo strojno opremo" into a command line (and I would be running into all sorts of problems with codepages in DOS window that way) I updated it to also accept window handle instead of a windows caption. (Yes, I could type everything into Start, Run without any codepage problems, but this solution was way more satisfying :) )

If you run into something like that in the future, feel free to use my little app.

program SetWindowPlacement;

{$APPTYPE CONSOLE}

uses
  Windows,
  SysUtils;

  procedure SetWindowPos(windowTitle: string; windowLeft, windowTop,
    windowWidth, windowHeight: integer);
  var
    flags  : cardinal;
    hWindow: THandle;
  begin
    if Copy(windowTitle, 1, 1) = '#' then begin
      Delete(windowTitle, 1, 1);
      hWindow := StrToInt(windowTitle);
    end
    else
      hWindow := FindWindow(nil, PChar(windowTitle));
    if hWindow = 0 then
      Writeln('Window >', windowTitle, '< not found!')
    else begin
      flags := SWP_NOACTIVATE OR SWP_NOZORDER;
      if (windowWidth = 0) and (windowHeight = 0) then
        flags := flags OR SWP_NOSIZE;
      Windows.SetWindowPos(hWindow, 0, windowLeft, windowTop, windowWidth, windowHeight,
        flags);
    end;
  end; { SetWindowPos }

  procedure Usage;
  begin
    Writeln('SetWindowPlacement');
    Writeln('(c) 2006 Primoz Gabrijelcic');
    Writeln;
    Writeln('Usage: SetWindowPlacement (<window title>|#<window handle>) <left> <top> [<width> <height>]');
  end; { Usage }

begin
  if (ParamCount < 3) or (ParamCount > 5) then
    Usage
  else try
    if ParamCount = 3 then
      SetWindowPos(ParamStr(1), StrToInt(ParamStr(2)), StrToInt(ParamStr(3)),
        0, 0)
    else
      SetWindowPos(ParamStr(1), StrToInt(ParamStr(2)), StrToInt(ParamStr(3)),
        StrToInt(ParamStr(4)), StrToInt(ParamStr(5)));
  except Usage; end;
end.

Monday, October 16, 2006

My very personal Algorithms and Data Structures

Thanks to the wonders of the printing on demand, I have received my very personal (printed just for me! merry!) copy of Julian M Bucknall's Algorithms and Data Structures. (previous post on that topic)

I won't comment on the content much - this is a great book and every serious Delphi programmer should have it on the shelf. Plus the Julian's writing is fluid and well organized and always to the point.

Printing could be improved a little, though. Font on inner pages is slightly too black for my taste and shadings under the code fragments are waaay too dense. Covers are weird - it seems that the font on the front and back cover was aliased to a different color than used on the cover for the background. It looks unsharp from the distance and halo'd if you look closer.

Still, the book is totally readable and printing is much better than I expected. Binding is fine, too. I have no worries that the book will fall apart in my hands after only a short use.

I hope Julian won't mind if I attach here few photos for the yet undecided. (Yes, the bottom right corner is slightly damaged, but that's courtesy of mail service. The book could be wrapped and packed better, though, so that such things wouldn't happen.)

Wednesday, October 04, 2006

Algorithms and Data Structures

Julian Bucknall has made his excellent book The Tomes of Delphi: Algorithms and Data Structures available on Lulu.

If you don't have it yet, don't wait. Get your copy now! Every Delphi developer should have it!

[I don't know if it shows ;) but I'm a great fan of Julian's writing.]