Memory efficient strings in .NET

I’m sure all of you know the basic fundamentals about strings in .NET: They’re immutable, the equality operator compares value instead of reference, etc. But did you know each character in a string takes up 2 bytes of memory? That’s right. Every standard string in .NET is a unicode string in memory, meaning the string “abc” would look like: 61 00 62 00 63 00 in memory. This might not sound like a big deal, but think about a string that is, say, 32768 characters long would take up 65536 bytes in memory (if we exclude metadata stuff). So why is implemented like this? Most likely because of compatibility and uniform of code. For example, a lot of WinAPIs have ugly suffixes on their names such as “MessageBoxW/MessageBoxA” or “LoadLibraryW/LoadLibraryA” depending on if they take a wide unicode string or a standard ASCII string. If every string is always assumed to be unicode, this isn’t an issue.

But what if we have a string only containing characters included in the ASCII table (0-127), for example: The quick brown fox jumps over the lazy dog”. It feels redundant to add an extra 00 byte after each character, doesn’t it?

So I thought I could create a class that stores strings in memory with ASCII encoding rather than unicode, meaning we could essentially half the memory usage for each string object. But after some googling I found out Jon Skeet had a great article about this very subject. I’m gonna base my struct on his example, but extend it a bit to make it a bit easier to use in your code.

View the struct here: http://pastebin.com/tLhzTacj

Keep in mind it’s not finished, and there’s just very basic safety checks which needs to be improved. But the struct should you give you an idea on how it could be done. If you wish to take the challenge on completing the struct to mimic the .NET string better, you could look at the .NET String implementation and work from there. Also, there’s some neat implicit operator overloads that allow you to do this:

 static void Main(string[] args)
 {
     AsciiString str1 = "hello world"; // about half memory of str2
     string str2 = "hello world";      // about double memory of str1

     Console.WriteLine(str1);
     Console.WriteLine(str2);

     Console.ReadLine();
 }

But enough talk. Let’s take a look at the memory usage of these two strings:

String of size 100:

ss (2014-05-27 at 12.17.53)

String of size 1000:

ss (2014-05-27 at 12.18.35)

You have to keep in mind though that since there’s no official support for Ascii strings in .NET the AsciiString object will be converted back to a full size unicode string when used in methods accepting a .NET string. Let’s just hope that in the future the .NET developers might consider adding an additional type and native support for this. 🙂

Advertisements

4 comments

    1. I could, but it wouldn’t really make any difference since the AsciiString struct only allows characters from 0-127. And since UTF8 only encodes on characters > 127 they would both be the same size. 🙂

      1. True. My point however is that you could make a string that can hold any Unicode character and still using a smaller amount of bytes than a normal Unicode string when using the UTF-8 encoding. Using the UTF-8 encoding would also mean you keep the uniformity. Now you would have two classes, an ASCII string and a Unicode string, which will require you to have these “ugly” suffixes or overloads you’ve been talking about ;).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s