String comparisons must be one of the most common things we do in C#; and string comparisons can be really expensive! So its worthwhile knowing the what, when and why to improving string comparison performance.
In this article I will explore one way – string interning.
What is string interning?
String interning refers to having a single copy of each unique string in an string intern pool, which is via a hash table in the.NET common language runtime (CLR). Where the key is a hash of the string and the value is a reference to the actual String object.
So if I have the same string occurring 100 times, interning will ensure only one occurrence of that string is actually allocated any memory. Also, when I wish to compare strings, if they are interned, then I just need to do a reference comparison.
String interning mechanics
In this example, I explicitly intern string literals just for demonstration purposes.
var s1 = string.Intern("stringy"); var s2 = string.Intern("stringy"); var s3 = string.Intern("stringy");
- This new “stringy” is hashed and the hash is looked up in our pool (of course its not there yet)
- A copy of the “stringy” object would be made
- A new entry would be made to the interning hash table, with the key being the hash of “stringy” and the value being a reference to the copy of “stringy”
- Assuming application no longer references original “stringy”, the GC can reclaim that memory
Line 2: This new “stringy” is hashed and the hash is looked up in our pool (which we just put there). The reference to the “stringy” copy is returned
Line 3: Same as line 2
Interning depends on string immutability
Take a classic example of string immutability:
var s1 = "stringy"; s1 += " new string";
We know that when line 2 is executed, the “stringy” object is de-referenced and left for garbage collection; and s1 then points to a new String object “stringy new string”.
String interning works because the CLR knows, categorically, that the String object referenced cannot change. Here I’ve added a fourth line to the earlier example:
var s1 = string.Intern("stringy"); var s2 = string.Intern("stringy"); var s3 = string.Intern("stringy"); s1 += " new string";
Line 4: s1 doesn’t change because it is immutable; it now points to a new String object ” stringy new string”.
s2 and s3 will still safely reference the copy that was made at line 1
Using Interning in the .NET CLR
You’ve already seen a bit in the examples above. .NET offers two static string methods:
It hashes string
str and checks the intern pool hash table and either:
- returns a reference to the (already) interned string, if interned; or
- a references to str is added to the intern pool and this reference is returned
It hashes string
str and checks the intern pool hash table. Rather than a standard
bool, it returns either:
- null, if not interned
- a reference to the (already) interned string, if interned
String literals (not generated in code) are automatically interned, although CLR versions have varied in this behaviour, so if you expect strings interned, it is best to always do it explicitly in your code.
My simple test Setup
I’ve run some timed tests on my aging i5 Windows 10 PC at home to provide some numbers to help explore the potential gains from string interning. I used the following test setup:
- Strings randomly generated
- All string comparisons are ordinal
- Strings are all the same length of 512 characters as I want the CLR will compare every character to force the worst-case (the CLR checks string length first for ordinal comparisons)
- The string added last (so to the end of the
List<T>) is also stored as a ‘known’ search term. This is because I am only exploring the worst-case approach
- For the
List<T>interned, every string added to the list, and also the search term, were wrapped in the
I timed how long populating each collection took and then how long searching for the known search term took also, to the nearest millisecond.
The collections/approaches used for my tests:
List<T>with no interning used, searched via a foreach loop and string.Equals on each element
List<T>with no interning used, searched via the Contains method
List<T>with interning used, searched via a foreach loop and
HashSet<T>, searched via the Contains method
The main objective is to observe string search performance gains from using string interning with
HashSet is just included as a baseline for known fast searches.
My simple test Results
In Figure 1 below, I have plotted the size of collections in number of strings added, against the time taken to add that number of randomly generated strings. Clearly there is no significant difference in this operation, when using a
HashSet<T> compared to a
List<T> without interning. Perhaps if had run to larger sets the gap would widen further based on trend?
There is slightly more overhead when populating the
List<T> with string interning, which is initially of no consequence but is growing faster than the other options.