Recently I had to convert a multi-line string into a single-line string value for storage (ignore the 'why' of it; let's just blame it on a legacy code). My first idea was to do a Base64 encode of the string, but I was in a mood for some fun and so I asked my friendly AI helpers:
"I have a multiline string containing newline ASCII characters (TMemo.Text). I want to change it to a single-line string with only printable ASCII characters. I could do that with BASE64 encoding, for example. I would, however, like to keep the text as much readable as possible by "encoding" only non-printable characters. Is there a simple way to do that?"
Let's see what they came up with!
Full transcripts, as usual, are on GitHub. This time I had also created a program containing all implementations.
[Gemini]
Gemini wrote a simple function that leaves characters from #32 to #126 intact and replaces the rest with a # followed by a two-letter hexadecimal representation of the character. It also wrote a decoding function and a test code. (Function names in this post are the same as in the accompanying code, and not as they were generated by the AI.)
function Gemini_EncodeNonPrintableChars(const InputStr: string): string;
var
i: Integer;
ch: Char;
EncodedStr: string;
begin
EncodedStr := '';
for i := 1 to Length(InputStr) do
begin
ch := InputStr[i];
if (ch >= #32) and (ch <= #126) then // Printable ASCII range
EncodedStr := EncodedStr + ch
else
EncodedStr := EncodedStr + Format('#%2.2x', [Ord(ch)]); // Encode as #XX
end;
Result := EncodedStr;
end;
The encoding function has a big problem. Let's see if Gemini can spot it:
Can you check this code for any problems it may exhibit?
It did not spot the big problem but insted found something that is not really an issue:
The `Format('#%2.2x', [Ord(ch)])` part of the encoding function assumes that all characters can be represented by a two-digit hexadecimal code. This is true for ASCII characters (0-255), but Delphi strings are Unicode-based, and characters beyond #$FF (255) would require more than two hex digits. This would lead to incorrect decoding for those characters.
This is not really a problem as %2.2x will correctly write numbers above 255 with more than two hexadecimal characters. So I had to be more explicit in my question:
What about the case when character '#' is present in the string to be encoded?
Now the AI got it:
You're absolutely right. If the original string contains a literal `#` character, the decoding function will misinterpret it as the start of an encoded sequence.
Here's the corrected code that addresses this issue by escaping literal `#` characters during encoding:
function Gemini_EncodeNonPrintableChars_Fixed(const InputStr: string): string;
var
i: Integer;
ch: Char;
EncodedStr: string;
begin
EncodedStr := '';
for i := 1 to Length(InputStr) do
begin
ch := InputStr[i];
if ch = '#' then
EncodedStr := EncodedStr + '##' // Escape literal # characters
else if (ch >= #32) and (ch <= #126) then
EncodedStr := EncodedStr + ch
else
EncodedStr := EncodedStr + Format('#%.2x', [Ord(ch)]);
end;
Result := EncodedStr;
end;
This indeed works but the new decoding function that it wrote can't decode newlines anymore. Eh, whatever. Let's see what the other contenders have to say.
[Ollama]
Codellama was clear and concise and incorrect. Well, we know we can't have all three ;)
Sure! You can use the `StrUtils` unit's `QuotedStr` function to escape any special characters in your string, such as newline and control characters. Here's an example:
```
var
s: String;
begin
s := 'Hello\nWorld\r'; // Newline and return characters
Writeln(s);
s := QuotedStr(s); // Escapes special characters
Writeln(s);
end.
```
This will output the following:
```
Hello
World
'Hello\nWorld\r'
```As you can see, the `QuotedStr` function has escaped the newline and return characters in your string, making it a single-line string with only printable ASCII characters.
Not really. There are no newline characters in 'Hello\nWorld\r' (this is Delphi!). Also, QuotedStr does not escape newlines. Plus this will not handle other non-printable characters.
After some back and forth it suggested to use AnsiStrings.StringToHex and AnsiStrings.HexToString. As there are no such functions in the AnsiStrings unit, this gives us a nice place to move on.
[Claude]
Similarly to Gemini, Claude wrote a function that did not escape the escape character. In this case, however, my generic question worked:
Can you check this code for any problems it may exhibit?Claude:
... skipped lots of code and text ...Improvements in the enhanced version:
1. Uses `TStringBuilder` for better performance
2. Handles Unicode characters properly
3. Includes proper error handling
4. Escapes backslash character to prevent ambiguity
5. Includes bounds checking
6. Handles empty strings
7. Includes a custom exception type
8. Includes test cases
9. Uses 4-digit hex values for full Unicode support
10. Properly handles string indexes using Low/High
Many of the changes were completely unnecessary but it did manage to escape the backslash character.
function Claude_EncodeToSingleLine_Fixed(const AText: string): string;
const
MinPrintable = 32;
MaxPrintable = 126;
var
SB: TStringBuilder;
begin
if AText = '' then
Exit('');
SB := TStringBuilder.Create(Length(AText) * 2);
try
for var I := Low(AText) to High(AText) do
begin
var Ch := AText[I];
if (Ord(Ch) <= MaxPrintable) and (Ord(Ch) >= MinPrintable)
and (Ch <> '\') then // Escape backslash too
SB.Append(Ch)
else
SB.Append('\x').Append(IntToHex(Ord(Ch), 4));
end;
Result := SB.ToString;
finally
SB.Free;
end;
end;
This version, although too convoluted (and probably slower as it uses TStringBuilder) seems to be working just fine.
[CoPilot]
CoPilot initially wrote a solution very similar to Gemini's. It did not provide a decoding function, though. To get a fixed version I also had to ask it the explicit question about input string containing an escape character. This is the fixed version:
function CoPilot_EncodeNonPrintableChars_Fixed(const Input: string): string;
var
i: Integer;
ch: Char;
sb: TStringBuilder;
begin
sb := TStringBuilder.Create;
try
for i := 1 to Length(Input) do
begin
ch := Input[i];
if ch = '#' then
sb.Append('##') // Escape the '#' character
else if (Ord(ch) < 32) or (Ord(ch) > 126) then
sb.AppendFormat('#%d;', [Ord(ch)]) // Replace non-printable characters
else
sb.Append(ch); // Keep printable characters as is
end;
Result := sb.ToString;
finally
sb.Free;
end;
end;
[OpenAI]
When I asked it to fix the code, it wandered away talking about Unicode surrogate characters and whatnots. See the appropriate log for its explanations.
function OpenAI_MakePrintableUnicode_Fixed(const S: string): string;
var
i: Integer;
Code: Integer;
SB: TStringBuilder;
HighSurrogate, LowSurrogate: Char;
CombinedCode: UInt32;
begin
SB := TStringBuilder.Create(Length(S));
try
i := 1;
while i <= Length(S) do
begin
Code := Ord(S[i]);
if (Code >= $D800) and (Code <= $DBFF) then // High Surrogate
begin
if (i + 1) <= Length(S) then
begin
LowSurrogate := S[i + 1];
if (Ord(LowSurrogate) >= $DC00) and (Ord(LowSurrogate) <= $DFFF) then
begin
// Combine surrogate pair into a single code point
CombinedCode := ((Code - $D800) shl 10) + (Ord(LowSurrogate) - $DC00) + $10000;
SB.AppendFormat('\U%.8X', [CombinedCode]);
Inc(i, 2);
Continue;
end;
end;
// Invalid surrogate pair
SB.AppendFormat('\u%.4X', [Code]);
end
else if (Code >= $DC00) and (Code <= $DFFF) then // Low Surrogate without preceding High Surrogate
begin
// Invalid surrogate
SB.AppendFormat('\u%.4X', [Code]);
end
else
begin
case Code of
Ord('\'): SB.Append('\\'); // Escape backslash
9: SB.Append('\t'); // Tab
10: SB.Append('\n'); // Line Feed
13: SB.Append('\r'); // Carriage Return
32..126:
SB.Append(S[i]); // Printable ASCII
else
if Code <= $FFFF then
SB.AppendFormat('\u%.4X', [Code]) // Unicode \uXXXX
else
SB.AppendFormat('\U%.8X', [Code]); // Unicode \UXXXXXXXX
end;
end;
Inc(i);
end;
Result := SB.ToString;
finally
SB.Free;
end;
end;
Conclusion
I must say I like Gemini's solution the most - especially as it did not overcomplicate it when fixing the problem. Too bed it did not spot the problem at the first place.
And what did I use at the end you may ask? TNetEncoding.Base64.Encode and TNetEncoding.Base64.Decode :)
haha nice "end" :-)
ReplyDelete