Sunday, January 19, 2025

Delphi and AI[5]: Encoding a multi-line string

Recently I had to convert a multi-line string into a single-line string value for storage (ignore the 'why' of it; let's just blame it on a legacy code). My first idea was to do a Base64 encode of the string, but I was in a mood for some fun and so I asked my friendly AI helpers:

"I have a multiline string containing newline ASCII characters (TMemo.Text). I want to change it to a single-line string with only printable ASCII characters. I could do that with BASE64 encoding, for example. I would, however, like to keep the text as much readable as possible by "encoding" only non-printable characters. Is there a simple way to do that?"

Let's see what they came up with!

Full transcripts, as usual, are on GitHub. This time I had also created a program containing all implementations.

[Gemini]

Gemini wrote a simple function that leaves characters from #32 to #126 intact and replaces the rest with a # followed by a two-letter hexadecimal representation of the character. It also wrote a decoding function and a test code. (Function names in this post are the same as in the accompanying code, and not as they were generated by the AI.)

function Gemini_EncodeNonPrintableChars(const InputStr: string): string;

var
  i: Integer;
  ch: Char;
  EncodedStr: string;
begin
  EncodedStr := '';
  for i := 1 to Length(InputStr) do
  begin
    ch := InputStr[i];
    if (ch >= #32) and (ch <= #126) then // Printable ASCII range
      EncodedStr := EncodedStr + ch
    else
      EncodedStr := EncodedStr + Format('#%2.2x', [Ord(ch)]); // Encode as #XX
  end;
  Result := EncodedStr;
end;

The encoding function has a big problem. Let's see if Gemini can spot it:

Can you check this code for any problems it may exhibit?

It did not spot the big problem but insted found something that is not really an issue:

The `Format('#%2.2x', [Ord(ch)])` part of the encoding function assumes that all characters can be represented by a two-digit hexadecimal code. This is true for ASCII characters (0-255), but Delphi strings are Unicode-based, and characters beyond #$FF (255) would require more than two hex digits.  This would lead to incorrect decoding for those characters.

This is not really a problem as %2.2x will correctly write numbers above 255 with more than two hexadecimal characters. So I had to be more explicit in my question:

What about the case when character '#' is present in the string to be encoded?

Now the AI got it:

You're absolutely right.  If the original string contains a literal `#` character, the decoding function will misinterpret it as the start of an encoded sequence.  

Here's the corrected code that addresses this issue by escaping literal `#` characters during encoding:

function Gemini_EncodeNonPrintableChars_Fixed(const InputStr: string): string;
var
  i: Integer;
  ch: Char;
  EncodedStr: string;
begin
  EncodedStr := '';
  for i := 1 to Length(InputStr) do
  begin
    ch := InputStr[i];
    if ch = '#' then
      EncodedStr := EncodedStr + '##' // Escape literal # characters
    else if (ch >= #32) and (ch <= #126) then
      EncodedStr := EncodedStr + ch
    else
      EncodedStr := EncodedStr + Format('#%.2x', [Ord(ch)]);
  end;
  Result := EncodedStr;
end;

This indeed works but the new decoding function that it wrote can't decode newlines anymore. Eh, whatever. Let's see what the other contenders have to say.

[Ollama]

Codellama was clear and concise and incorrect. Well, we know we can't have all three ;)

Sure! You can use the `StrUtils` unit's `QuotedStr` function to escape any special characters in your string, such as newline and control characters. Here's an example:

```
var
  s: String;
begin
  s := 'Hello\nWorld\r'; // Newline and return characters
  Writeln(s);
  s := QuotedStr(s); // Escapes special characters
  Writeln(s);
end.
```
This will output the following:
```
Hello
World

'Hello\nWorld\r'
```

As you can see, the `QuotedStr` function has escaped the newline and return characters in your string, making it a single-line string with only printable ASCII characters.

Not really. There are no newline characters in 'Hello\nWorld\r' (this is Delphi!). Also, QuotedStr does not escape newlines. Plus this will not handle other non-printable characters.

After some back and forth it suggested to use AnsiStrings.StringToHex and AnsiStrings.HexToString. As there are no such functions in the AnsiStrings unit, this gives us a nice place to move on.

[Claude]

Similarly to Gemini, Claude wrote a function that did not escape the escape character. In this case, however, my generic question worked:

Can you check this code for any problems it may exhibit?

Claude:

... skipped lots of code and text ...

Improvements in the enhanced version:
1. Uses `TStringBuilder` for better performance
2. Handles Unicode characters properly
3. Includes proper error handling
4. Escapes backslash character to prevent ambiguity
5. Includes bounds checking
6. Handles empty strings
7. Includes a custom exception type
8. Includes test cases
9. Uses 4-digit hex values for full Unicode support
10. Properly handles string indexes using Low/High

Many of the changes were completely unnecessary but it did manage to escape the backslash character.

function Claude_EncodeToSingleLine_Fixed(const AText: string): string;
const
  MinPrintable = 32;
  MaxPrintable = 126;
var
  SB: TStringBuilder;
begin
  if AText = '' then
    Exit('');
 
  SB := TStringBuilder.Create(Length(AText) * 2);
  try
    for var I := Low(AText) to High(AText) do
    begin
      var Ch := AText[I];
      if (Ord(Ch) <= MaxPrintable) and (Ord(Ch) >= MinPrintable)
         and (Ch <> '\') then // Escape backslash too
        SB.Append(Ch)
      else
        SB.Append('\x').Append(IntToHex(Ord(Ch), 4));
    end;
    Result := SB.ToString;
  finally
    SB.Free;
  end;
end;

This version, although too convoluted (and probably slower as it uses TStringBuilder) seems to be working just fine.

[CoPilot]

CoPilot initially wrote a solution very similar to Gemini's. It did not provide a decoding function, though. To get a fixed version I also had to ask it the explicit question about input string containing an escape character. This is the fixed version:

function CoPilot_EncodeNonPrintableChars_Fixed(const Input: string): string;
var
  i: Integer;
  ch: Char;
  sb: TStringBuilder;
begin
  sb := TStringBuilder.Create;
  try
    for i := 1 to Length(Input) do
    begin
      ch := Input[i];
      if ch = '#' then
        sb.Append('##'// Escape the '#' character
      else if (Ord(ch) < 32) or (Ord(ch) > 126) then
        sb.AppendFormat('#%d;', [Ord(ch)])  // Replace non-printable characters
      else
        sb.Append(ch);  // Keep printable characters as is
    end;
    Result := sb.ToString;
  finally
    sb.Free;
  end;
end;

[OpenAI]

O1-mini initially offered four different solutions (it does like to talk a lot!), all of them missing the problem of escaping the escape character. One of them even used TRegEx‽ Why‽ While producting all that code it failed to write a decoder. (Truth be told, I did not ask it for one.)

When I asked it to fix the code, it wandered away talking about Unicode surrogate characters and whatnots. See the appropriate log for its explanations.

Only after I asked about input string containing a backslash, it wrote a fixed version. See it here in all its monstrosity:

function OpenAI_MakePrintableUnicode_Fixed(const S: string): string;
var
  i: Integer;
  Code: Integer;
  SB: TStringBuilder;
  HighSurrogate, LowSurrogate: Char;
  CombinedCode: UInt32;
begin
  SB := TStringBuilder.Create(Length(S));
  try
    i := 1;
    while i <= Length(S) do
    begin
      Code := Ord(S[i]);
      if (Code >= $D800) and (Code <= $DBFF) then // High Surrogate
      begin
        if (i + 1) <= Length(S) then
        begin
          LowSurrogate := S[i + 1];
          if (Ord(LowSurrogate) >= $DC00) and (Ord(LowSurrogate) <= $DFFF) then
          begin
            // Combine surrogate pair into a single code point
            CombinedCode := ((Code - $D800) shl 10) + (Ord(LowSurrogate) - $DC00) + $10000;
            SB.AppendFormat('\U%.8X', [CombinedCode]);
            Inc(i, 2);
            Continue;
          end;
        end;
        // Invalid surrogate pair
        SB.AppendFormat('\u%.4X', [Code]);
      end
      else if (Code >= $DC00) and (Code <= $DFFF) then // Low Surrogate without preceding High Surrogate
      begin
        // Invalid surrogate
        SB.AppendFormat('\u%.4X', [Code]);
      end
      else
      begin
        case Code of
          Ord('\'): SB.Append('\\');            // Escape backslash
          9: SB.Append('\t');                    // Tab
          10: SB.Append('\n');                   // Line Feed
          13: SB.Append('\r');                   // Carriage Return
          32..126:
            SB.Append(S[i]);                     // Printable ASCII
        else
          if Code <= $FFFF then
            SB.AppendFormat('\u%.4X', [Code])    // Unicode \uXXXX
          else
            SB.AppendFormat('\U%.8X', [Code]);   // Unicode \UXXXXXXXX
        end;
      end;
      Inc(i);
    end;
    Result := SB.ToString;
  finally
    SB.Free;
  end;
end;

The accompanying decoding function did not compile at first, but after fixing one stupid problem (marked with *** in the code) and removing few not-working (and non-essential) lines of code (also marked with ***) it worked correctly. Truly, an impressive "business"-class implementation!

Conclusion

I must say I like Gemini's solution the most - especially as it did not overcomplicate it when fixing the problem. Too bed it did not spot the problem at the first place.

And what did I use at the end you may ask? TNetEncoding.Base64.Encode and TNetEncoding.Base64.Decode :)

1 comment: