The Delphi Geek: Delphi and AI[5]: Encoding a multi-line string

Recently I had to convert a multi-line string into a single-line string value for storage (ignore the 'why' of it; let's just blame it on a legacy code). My first idea was to do a Base64 encode of the string, but I was in a mood for some fun and so I asked my friendly AI helpers:

"I have a multiline string containing newline ASCII characters (TMemo.Text). I want to change it to a single-line string with only printable ASCII characters. I could do that with BASE64 encoding, for example. I would, however, like to keep the text as much readable as possible by "encoding" only non-printable characters. Is there a simple way to do that?"

Let's see what they came up with!

Full transcripts, as usual, are on GitHub. This time I had also created a program containing all implementations.

[Gemini]

Gemini wrote a simple function that leaves characters from #32 to #126 intact and replaces the rest with a # followed by a two-letter hexadecimal representation of the character. It also wrote a decoding function and a test code. (Function names in this post are the same as in the accompanying code, and not as they were generated by the AI.)

function Gemini_EncodeNonPrintableChars(const InputStr: string): string;
var
  i: Integer;
  ch: Char;
EncodedStr: string;
begin
EncodedStr := '';
  for i := 1 to Length(InputStr) do
  begin
  ch := InputStr[i];
  if (ch >= #32) and (ch <= #126) then // Printable ASCII range
  EncodedStr := EncodedStr + ch
  else
EncodedStr := EncodedStr + Format('#%2.2x', [Ord(ch)]); // Encode as #XX
  end;
  Result := EncodedStr;
end;

The encoding function has a big problem. Let's see if Gemini can spot it:

Can you check this code for any problems it may exhibit?

It did not spot the big problem but insted found something that is not really an issue:

The `Format('#%2.2x', [Ord(ch)])` part of the encoding function assumes that all characters can be represented by a two-digit hexadecimal code. This is true for ASCII characters (0-255), but Delphi strings are Unicode-based, and characters beyond #$FF (255) would require more than two hex digits. This would lead to incorrect decoding for those characters.

This is not really a problem as %2.2x will correctly write numbers above 255 with more than two hexadecimal characters. So I had to be more explicit in my question:

What about the case when character '#' is present in the string to be encoded?

Now the AI got it:

You're absolutely right. If the original string contains a literal `#` character, the decoding function will misinterpret it as the start of an encoded sequence.
Here's the corrected code that addresses this issue by escaping literal `#` characters during encoding:

function Gemini_EncodeNonPrintableChars_Fixed(const InputStr: string): string;
var
  i: Integer;
  ch: Char;
EncodedStr: string;
begin
EncodedStr := '';
  for i := 1 to Length(InputStr) do
  begin
  ch := InputStr[i];
  if ch = '#' then
EncodedStr := EncodedStr + '##' // Escape literal # characters
  else if (ch >= #32) and (ch <= #126) then
  EncodedStr := EncodedStr + ch
  else
EncodedStr := EncodedStr + Format('#%.2x', [Ord(ch)]);
  end;
  Result := EncodedStr;
end;

This indeed works but the new decoding function that it wrote can't decode newlines anymore. Eh, whatever. Let's see what the other contenders have to say.

[Ollama]

Codellama was clear and concise and incorrect. Well, we know we can't have all three ;)

Sure! You can use the `StrUtils` unit's `QuotedStr` function to escape any special characters in your string, such as newline and control characters. Here's an example:
```
var
s: String;
begin
s := 'Hello\nWorld\r'; // Newline and return characters
Writeln(s);
s := QuotedStr(s); // Escapes special characters
Writeln(s);
end.
```
This will output the following:
```
Hello
World

'Hello\nWorld\r'
```
As you can see, the `QuotedStr` function has escaped the newline and return characters in your string, making it a single-line string with only printable ASCII characters.

Not really. There are no newline characters in 'Hello\nWorld\r' (this is Delphi!). Also, QuotedStr does not escape newlines. Plus this will not handle other non-printable characters.

After some back and forth it suggested to use AnsiStrings.StringToHex and AnsiStrings.HexToString. As there are no such functions in the AnsiStrings unit, this gives us a nice place to move on.

[Claude]

Similarly to Gemini, Claude wrote a function that did not escape the escape character. In this case, however, my generic question worked:
Can you check this code for any problems it may exhibit?
Claude:

... skipped lots of code and text ...
Improvements in the enhanced version:
1. Uses `TStringBuilder` for better performance
2. Handles Unicode characters properly
3. Includes proper error handling
4. Escapes backslash character to prevent ambiguity
5. Includes bounds checking
6. Handles empty strings
7. Includes a custom exception type
8. Includes test cases
9. Uses 4-digit hex values for full Unicode support
10. Properly handles string indexes using Low/High

Many of the changes were completely unnecessary but it did manage to escape the backslash character.
function Claude_EncodeToSingleLine_Fixed(const AText: string): string;
const
MinPrintable = 32;
MaxPrintable = 126;
var
SB: TStringBuilder;
begin
  if AText = '' then
Exit('');

SB := TStringBuilder.Create(Length(AText) * 2);
  try
  for var I := Low(AText) to High(AText) do
  begin
  var Ch := AText[I];
  if (Ord(Ch) <= MaxPrintable) and (Ord(Ch) >= MinPrintable)
and (Ch <> '\') then // Escape backslash too
SB.Append(Ch)
  else
SB.Append('\x').Append(IntToHex(Ord(Ch), 4));
  end;
Result := SB.ToString;
  finally
SB.Free;
  end;
end;

This version, although too convoluted (and probably slower as it uses TStringBuilder) seems to be working just fine.

[CoPilot]

CoPilot initially wrote a solution very similar to Gemini's. It did not provide a decoding function, though. To get a fixed version I also had to ask it the explicit question about input string containing an escape character. This is the fixed version:

function CoPilot_EncodeNonPrintableChars_Fixed(const Input: string): string;
var
  i: Integer;
  ch: Char;
  sb: TStringBuilder;
begin
  sb := TStringBuilder.Create;
  try
  for i := 1 to Length(Input) do
  begin
  ch := Input[i];
  if ch = '#' then
sb.Append('##') // Escape the '#' character
  else if (Ord(ch) < 32) or (Ord(ch) > 126) then
sb.AppendFormat('#%d;', [Ord(ch)]) // Replace non-printable characters
  else
sb.Append(ch); // Keep printable characters as is
  end;
  Result := sb.ToString;
  finally
  sb.Free;
  end;
end;

[OpenAI]

O1-mini initially offered four different solutions (it does like to talk a lot!), all of them missing the problem of escaping the escape character. One of them even used TRegEx‽ Why‽ While producting all that code it failed to write a decoder. (Truth be told, I did not ask it for one.)

When I asked it to fix the code, it wandered away talking about Unicode surrogate characters and whatnots. See the appropriate log for its explanations.

Only after I asked about input string containing a backslash, it wrote a fixed version. See it here in all its monstrosity:

function OpenAI_MakePrintableUnicode_Fixed(const S: string): string;
var
  i: Integer;
  Code: Integer;
  SB: TStringBuilder;
  HighSurrogate, LowSurrogate: Char;
  CombinedCode: UInt32;
begin
  SB := TStringBuilder.Create(Length(S));
  try
i := 1;
  while i <= Length(S) do
  begin
  Code := Ord(S[i]);
  if (Code >= $D800) and (Code <= $DBFF) then // High Surrogate
  begin
  if (i + 1) <= Length(S) then
  begin
LowSurrogate := S[i + 1];
  if (Ord(LowSurrogate) >= $DC00) and (Ord(LowSurrogate) <= $DFFF) then
  begin
  // Combine surrogate pair into a single code point
CombinedCode := ((Code - $D800) shl 10) + (Ord(LowSurrogate) - $DC00) + $10000;
SB.AppendFormat('\U%.8X', [CombinedCode]);
Inc(i, 2);
  Continue;
  end;
  end;
  // Invalid surrogate pair
SB.AppendFormat('\u%.4X', [Code]);
  end
  else if (Code >= $DC00) and (Code <= $DFFF) then // Low Surrogate without preceding High Surrogate
  begin
  // Invalid surrogate
SB.AppendFormat('\u%.4X', [Code]);
  end
  else
  begin
  case Code of
Ord('\'): SB.Append('\\');            // Escape backslash
  9: SB.Append('\t');                    // Tab
  10: SB.Append('\n');                   // Line Feed
  13: SB.Append('\r');                   // Carriage Return
  32..126:
SB.Append(S[i]);                     // Printable ASCII
  else
  if Code <= $FFFF then
SB.AppendFormat('\u%.4X', [Code])    // Unicode \uXXXX
  else
SB.AppendFormat('\U%.8X', [Code]);   // Unicode \UXXXXXXXX
  end;
  end;
  Inc(i);
  end;
  Result := SB.ToString;
  finally
  SB.Free;
  end;
end;

The accompanying decoding function did not compile at first, but after fixing one stupid problem (marked with *** in the code) and removing few not-working (and non-essential) lines of code (also marked with ***) it worked correctly. Truly, an impressive "business"-class implementation!

Conclusion

I must say I like Gemini's solution the most - especially as it did not overcomplicate it when fixing the problem. Too bed it did not spot the problem at the first place.

And what did I use at the end you may ask? TNetEncoding.Base64.Encode and TNetEncoding.Base64.Decode :)

The Delphi Geek

Sunday, January 19, 2025

Delphi and AI[5]: Encoding a multi-line string

[Gemini]

[Ollama]

[Claude]

Similarly to Gemini, Claude wrote a function that did not escape the escape character. In this case, however, my generic question worked:
Can you check this code for any problems it may exhibit?
Claude:

[CoPilot]

[OpenAI]

Conclusion

1 comment:

Sunday, January 19, 2025

Delphi and AI[5]: Encoding a multi-line string

[Gemini]

[Ollama]

[Claude]

Similarly to Gemini, Claude wrote a function that did not escape the escape character. In this case, however, my generic question worked:Can you check this code for any problems it may exhibit?Claude:

[CoPilot]

[OpenAI]

Conclusion

1 comment:

Similarly to Gemini, Claude wrote a function that did not escape the escape character. In this case, however, my generic question worked:
Can you check this code for any problems it may exhibit?
Claude: