March 12th, 2020 at 5:02 AM
The Background:
So I recently logged into a forum I used to frequent after hiatus of about 3-4 months. I figured I needed to purge most of my digital life to focus on other stuff at the time, but I've finally lifted the hosts file entry for that site so that I could check in again. There, I found a thread about parsing a string for certain values. Specifically, that string was a datestamp, in the following format:
02/01/2000
Not the exact value, but the one I'll be using here.
That would be 02 January 2000, day/month/year, the true Euro way.
Now, this is a site where people share random code samples or whatever just to say they've 'contributed' something even if it's garbage, so 99% of the code there is garbage, and this thread was no exception.
The Code:
Pretty useless, right? Anyone with half a brain would just use a regex for the data they need, or for the web-developers, just use the JavaScript built-in function var.split('/') to get the three values they need.
So, let's count the bad code practices:
Stroustrup is okay too, especially given this is C++ and he essentially added useless bloat to the glorious C language made the language himself, so the else would be on a different line. That makes for easier-to-read code, anyway.
Side note as to answering the all-eternal question "Why?"
I like to rewrite pieces of code like this because, even though it's useless, it helps me get some practice programming in and also thinking more about how to solve a problem, even if it's trivial.
Not to mention the nerd who wrote it might actually learn something from my solution. We need better devs nowadays, I don't want more JavaScript frameworks that add 1.9MB to each time I try to visit a webpage, I want software that works and is fast and doesn't completely kill my CPU every time I need to do a bit of processing.
What is a String?
If asked this question, you're most likely to respond with 'a set of character-type data' without specifying what kind of character encoding it is (although generally ASCII is the default). Makes sense, and that would be a satisfactory answer, just a 'string' of bytes that are interpreted in a certain way.
But the C language never actually had a 'string' data type, unless you used <string.h> which we might need later on. But still, the C language stores strings as arrays of characters, with a nullbyte at the end to denote when the string ends.
So if you had a variable myString with the value "Hello World", in memory, it would appear as an array like this:
Makes sense, right? I mean, each letter needs to take up at least a byte of data anyway, so just putting them together would be the obvious way to do it.
But if you actually use that data in a function like printf() (to print a value) then you also need a format specifier string, so the whole function call would be this:
%s means that the function should take in a value from the next argument in the function call, and interpret that data as a string-type. The \n is the escape code for a newline character, if you didn't already know.
But again, what exactly is a string and how does the program actually interpret it?
If you try to declare a string in C you might have an issue with the following code:
Why though? That seems simple enough?
And if you used the array representation above, you'd be fine.
But double-quoted literals like this example are not of type char[], instead they are of the type const char*. So the program has trouble assigning the type char[] to const char* because const char* should just be a pointer, i.e. storing a memory address (between 4 bytes and 8 bytes depending on 32bit vs. 64bit) of an immutable type char, (one byte) even though we're trying to store a whole twelve characters (eleven if you forgot the nullbyte).
Do note that in recent years, this has changed and most compilers won't really care about that anymore, so the above code sample may very well work (and in my case in Visual Studio 2019 CE, it does work.)
So, back to actual interpretation of the abstract string-type.
Well, remember that nullbyte-termination at the very end?
Basically, what printf() will do is take in the myString argument not as an array of data, but rather as a pointer to where that data starts. It will then increment that pointer, and read the value in memory byte-by-byte, just like a loop, and print the value of each byte.
And once it encounters a nullbyte during its incrementation, it'll stop reading values and finish up the function call with a nice return value.
So that's why string-literal data (double-quoted) is actually of a pointer type, and not a character/array type.
So why does it print byte-by-byte (aside from keeping it simple)?
The short answer is because that's generally how the operating system API is built. The write() function in the Linux syscall API can only write a single byte, using the registers RSI (where to get the byte), RDI (where to write the byte to), RAX (the syscall code itself, in the case of write() it's 1 on x86_64 architectures), and RDX (optional, only if you want to loop to write multiple bytes.) My experience is mainly for x86_64 assembly (nasm) on Linux so the registers for syscalls will vary between architectures.
Now, the std::string dataype for C++ is a full object with member methods and data fields. As such, it takes a ton of memory space compared to a simple C-string which is just a character array. So the first optimization I'd make to this program is to get rid of anything that has a string type, and consequently remove the <string> header because we don't need it.
I'll leave the 'using namespace std;' part intact because quite frankly in this scenario it really doesn't matter and just makes shorthanding std::endl and std::cout a bit easier.
The Rewrite:
So, we'll declare the original string as a char[] type.
While we're at it, we're going to also declare two pieces of data we will need later as well, offsets and the length of the string.
strlen() is exactly what you think it is, it counts the number of bytes in a (null-terminated) string. It's technically part of the <string.h> C header (<cstring> for C++) and I don't like using too much bloat, but it's a small header file with just a couple functions, so it's not a big deal. And on Windows (Visual Studio) I don't actually need to #include the header; it'll link it automatically during compilation.
We'll need this length later when you see what I do to parse the data, so keep it in mind.
The offsets piece of data is an array with two values, both of type integer. I'm going to use them to record where each piece of data begins and ends. Why only two? Because, just like how arrays start at 0, 2 is the third number from 0.
You'll understand soon enough.
So let's get down to the parsing.
I opted for a single loop, because quite frankly I don't need any more than that. It's a notch down from the original, so I'll take it.
I also need two iterators for this one. If you've used a c-like language before, you know that you can do multiple variable declarations of the same type on one line, so that's what I do for my iterators:
But I'm only using one iterator.
You'll see.
You can probably guess from i < len, I'm going to iterate over the string byte-by-byte. And inside the loop, I'm going to have a std::string.find() alternative (because like I said, std::string types are bloat).
So this should mostly make sense upon first glance.
What it does is when it encounters a character, it replaces it with a different character. In this case, I'm replacing the slash with a nullbyte (\0 is the escape code for a nullbyte). Since our string is an array of characters, it's no different than iterating over an array and finding the value that matches.
So what's up with the offsets[j] line?
Well, the first time it gets encountered is when j = 0, so offsets[0] is the first location Then j gets incremented. Then it encounters the second slash, and by then, j = 1, so it calls offsets[1], i.e. the second offset.
Yes, we increment j again and offsets[2] would be an out-of-bounds exception, but we never make the call to offsets[2] so it's fine.
And in offsets[0] or [1] we save the offset at which int i it encountered a slash in the string, or rather the byte after it.
Do you see where this is going yet? We're almost done, so back to a bit of theory. Remember when I declared len = strlen(input) and I said I'd explain later?
When I first wrote this program, I ran into a weird error that took me a solid minute to figure out what went wrong (using the Windows Debugger and peeking at the input string).
Why doesn't this work?:
Well, once it encounters the first slash, it turns it into a nullbyte and does that offset magic.
Then int i gets incremented and the loop starts again. At this point, it would be at the value 3 after incrementing.
But when the loop starts up again, it checks the value of int i against strlen(input).
strlen(input) isn't saved, and now that we need to call it again to verify the condition, strlen(input) returns 2, because the string is "02\001/2000" (Notice the \0 escaped nullbyte?) So strlen() only counts the first two characters then stops, and returns 2.
And i = 3, which is greater than 2, not less than 2, so the loop breaks then tries to continue execution, which will give you very, uh, 'bizarre' results.
But this is exactly the behaviour I'm looking for.
So altogether, this turns the original string "02/01/2000" into "02\001\02001" and while it's at it, it also saves the values at which '01' and '2000' begin. (so the offset for 0 (3) and the offset for 2 (6))
Now all that's left is to print it.
This is where we get a little f***, but bear with me.
The original used std::cout because of C++, so we'll do the same:
As described above about the strlen() problem in the loop, this will print nothing more than '02' and a newline (std::endl is a #define for a newline character in the C++ standard library.)
So what about the rest?
Well, strings are interpreted as pointers, so let's use pointers to get the rest of the job done.
First, we need to change the offset at where the string starts.
Well, the 'string' is passed as a pointer to std::cout, so let's change where that pointer actually points to.
A pointer saves a memory address, so first we need to actually access that memory address, then we just add our offset (that we saved in offset[0]) to it.
To access the address that a pointer stores rather than the data it's pointing at, we use the & operator:
This would return the actual address in memory (or rather the allocated memory of the program) which in my case is 0x010FFCC8.
Now, remember: Strings in C are just arrays of characters, so you might think that we could just call it like this:
And you're wrong. That returns a character, because we defined the string as the type char[] (rather, type char) earlier. In our case, it would just return '0' (because of 01).
So now you're thinking: "Let's type-cast it back to a type char*!"
Guess what: still doesn't work. std::cout also has its own implementations of strlen() and whatnot for when it needs to try and print a certain type of data, in our case a string. But when it does its own function calls here, it also tries to type-cast back to the original type of data.
C++ doesn't like to play nice with all these type-casts, and will throw an exception when it happens. Sorry, I don't make the rules, get pissed off at Stroustrup, not me.
So we need to actually use the offset we saved.
Now, arithmetic would say just do this: (&input + offsets[0]) but that'll also get garbled. That actually adds 33 to the memory address we have above (0x21). The reason is actually quite interesting and had me stumped for a while when I was trying to figure out the behaviour without needing to plug the thing into gdb or IDA.
Now, back to the subject at hand. How do we print the value at the right offset?
Well, we almost had it earlier when we tried input[offset[0]]. But we just need to use that ampersand character to indicate the memory address.
&input is not of type char, or char[]. It's the type char* already. And if you recall, char* is what strings really are.
So it starts reading the data starting from input[3] which happens to be '0' in our case.
The full code here that will match the output in the original sample is as follows:
Might be a little f*** because of the pointer arithmetic we did at the bottom, but it'll run faster, and when compiled (under Visual Studio 2019, default debug settings) it's 80% smaller in filesize (from 89kb to 49kb.)
So I recently logged into a forum I used to frequent after hiatus of about 3-4 months. I figured I needed to purge most of my digital life to focus on other stuff at the time, but I've finally lifted the hosts file entry for that site so that I could check in again. There, I found a thread about parsing a string for certain values. Specifically, that string was a datestamp, in the following format:
02/01/2000
Not the exact value, but the one I'll be using here.
That would be 02 January 2000, day/month/year, the true Euro way.
Now, this is a site where people share random code samples or whatever just to say they've 'contributed' something even if it's garbage, so 99% of the code there is garbage, and this thread was no exception.
The Code:
Code:
#include <iostream>
#include <string>
using namespace std;
int main()
{
int slash;
int i = 0;
string in;
string out[3];
in = "02/01/2000";
in = in + "/";
while(in != "\0")
{
slash = in.find("/");
for(int f = 0; f < slash; f ++)
{
out[i] = out[i] + in[f];
}
in.erase(0, slash + 1);
i ++;
}
cout << "Day : " << out[0] << endl;
cout << "Month : " << out[1] << endl;
cout << "Year : " << out[2] << endl;
}
Returns the following:
Day: 02
Month: 01
Year: 2000
So, let's count the bad code practices:
- No return value. Not a huge issue, it'll compile fine, but return values are used by the OS to verify that the application executed correctly or if there was an error.
- Two loops, meaning O(n^2) time in this case. The while loop doesn't need to check for a non-nullbyte value because if it was non-nullbyte it would return true anyway.
- Added abstraction with the <string> header (for C++)
- Use of 'using namespace std;' Not a glaring issue either, but if you're writing larger C++ (OO) programs, blending around namespaces like this can get confusing really fast.
- Use of (slow) string methods such as 'find.'
- Randomly terminating the string with a / for some reason
- Storing the output data in a separate array. std::cout is for outputting stream data, you could just use that (with stream operators and some basic logic) to output the data as needed.
Code:
if(cond) {
doStuff();
} else {
doSomethingElse();
}
Stroustrup is okay too, especially given this is C++ and he essentially added useless bloat to the glorious C language made the language himself, so the else would be on a different line. That makes for easier-to-read code, anyway.
Side note as to answering the all-eternal question "Why?"
I like to rewrite pieces of code like this because, even though it's useless, it helps me get some practice programming in and also thinking more about how to solve a problem, even if it's trivial.
Not to mention the nerd who wrote it might actually learn something from my solution. We need better devs nowadays, I don't want more JavaScript frameworks that add 1.9MB to each time I try to visit a webpage, I want software that works and is fast and doesn't completely kill my CPU every time I need to do a bit of processing.
What is a String?
If asked this question, you're most likely to respond with 'a set of character-type data' without specifying what kind of character encoding it is (although generally ASCII is the default). Makes sense, and that would be a satisfactory answer, just a 'string' of bytes that are interpreted in a certain way.
But the C language never actually had a 'string' data type, unless you used <string.h> which we might need later on. But still, the C language stores strings as arrays of characters, with a nullbyte at the end to denote when the string ends.
So if you had a variable myString with the value "Hello World", in memory, it would appear as an array like this:
Code:
myString = ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '\0']
But if you actually use that data in a function like printf() (to print a value) then you also need a format specifier string, so the whole function call would be this:
Code:
printf("%s\n", myString);
But again, what exactly is a string and how does the program actually interpret it?
If you try to declare a string in C you might have an issue with the following code:
Code:
char myString[] = "Hello World";
And if you used the array representation above, you'd be fine.
But double-quoted literals like this example are not of type char[], instead they are of the type const char*. So the program has trouble assigning the type char[] to const char* because const char* should just be a pointer, i.e. storing a memory address (between 4 bytes and 8 bytes depending on 32bit vs. 64bit) of an immutable type char, (one byte) even though we're trying to store a whole twelve characters (eleven if you forgot the nullbyte).
Do note that in recent years, this has changed and most compilers won't really care about that anymore, so the above code sample may very well work (and in my case in Visual Studio 2019 CE, it does work.)
So, back to actual interpretation of the abstract string-type.
Well, remember that nullbyte-termination at the very end?
Basically, what printf() will do is take in the myString argument not as an array of data, but rather as a pointer to where that data starts. It will then increment that pointer, and read the value in memory byte-by-byte, just like a loop, and print the value of each byte.
And once it encounters a nullbyte during its incrementation, it'll stop reading values and finish up the function call with a nice return value.
So that's why string-literal data (double-quoted) is actually of a pointer type, and not a character/array type.
So why does it print byte-by-byte (aside from keeping it simple)?
The short answer is because that's generally how the operating system API is built. The write() function in the Linux syscall API can only write a single byte, using the registers RSI (where to get the byte), RDI (where to write the byte to), RAX (the syscall code itself, in the case of write() it's 1 on x86_64 architectures), and RDX (optional, only if you want to loop to write multiple bytes.) My experience is mainly for x86_64 assembly (nasm) on Linux so the registers for syscalls will vary between architectures.
Now, the std::string dataype for C++ is a full object with member methods and data fields. As such, it takes a ton of memory space compared to a simple C-string which is just a character array. So the first optimization I'd make to this program is to get rid of anything that has a string type, and consequently remove the <string> header because we don't need it.
I'll leave the 'using namespace std;' part intact because quite frankly in this scenario it really doesn't matter and just makes shorthanding std::endl and std::cout a bit easier.
The Rewrite:
So, we'll declare the original string as a char[] type.
While we're at it, we're going to also declare two pieces of data we will need later as well, offsets and the length of the string.
Code:
char input[] = "02/01/2000";
int offsets[2];
int len = strlen(input);
We'll need this length later when you see what I do to parse the data, so keep it in mind.
The offsets piece of data is an array with two values, both of type integer. I'm going to use them to record where each piece of data begins and ends. Why only two? Because, just like how arrays start at 0, 2 is the third number from 0.
You'll understand soon enough.
So let's get down to the parsing.
I opted for a single loop, because quite frankly I don't need any more than that. It's a notch down from the original, so I'll take it.
I also need two iterators for this one. If you've used a c-like language before, you know that you can do multiple variable declarations of the same type on one line, so that's what I do for my iterators:
Code:
for(int i = 0, j = 0; i < len; ++i) {
}
You'll see.
You can probably guess from i < len, I'm going to iterate over the string byte-by-byte. And inside the loop, I'm going to have a std::string.find() alternative (because like I said, std::string types are bloat).
Code:
for (int i = 0, j = 0; i < len; ++i) {
if (input[i] == '/') {
input[i] = '\0';
offsets[j] = i + 1;
++j;
}
}
So this should mostly make sense upon first glance.
What it does is when it encounters a character, it replaces it with a different character. In this case, I'm replacing the slash with a nullbyte (\0 is the escape code for a nullbyte). Since our string is an array of characters, it's no different than iterating over an array and finding the value that matches.
So what's up with the offsets[j] line?
Well, the first time it gets encountered is when j = 0, so offsets[0] is the first location Then j gets incremented. Then it encounters the second slash, and by then, j = 1, so it calls offsets[1], i.e. the second offset.
Yes, we increment j again and offsets[2] would be an out-of-bounds exception, but we never make the call to offsets[2] so it's fine.
And in offsets[0] or [1] we save the offset at which int i it encountered a slash in the string, or rather the byte after it.
Do you see where this is going yet? We're almost done, so back to a bit of theory. Remember when I declared len = strlen(input) and I said I'd explain later?
When I first wrote this program, I ran into a weird error that took me a solid minute to figure out what went wrong (using the Windows Debugger and peeking at the input string).
Why doesn't this work?:
Code:
for (int i = 0, j = 0; i < strlen(input); ++i) {
}
Then int i gets incremented and the loop starts again. At this point, it would be at the value 3 after incrementing.
But when the loop starts up again, it checks the value of int i against strlen(input).
strlen(input) isn't saved, and now that we need to call it again to verify the condition, strlen(input) returns 2, because the string is "02\001/2000" (Notice the \0 escaped nullbyte?) So strlen() only counts the first two characters then stops, and returns 2.
And i = 3, which is greater than 2, not less than 2, so the loop breaks then tries to continue execution, which will give you very, uh, 'bizarre' results.
But this is exactly the behaviour I'm looking for.
So altogether, this turns the original string "02/01/2000" into "02\001\02001" and while it's at it, it also saves the values at which '01' and '2000' begin. (so the offset for 0 (3) and the offset for 2 (6))
Now all that's left is to print it.
This is where we get a little f***, but bear with me.
The original used std::cout because of C++, so we'll do the same:
Code:
cout << "Day: " << input << endl;
So what about the rest?
Well, strings are interpreted as pointers, so let's use pointers to get the rest of the job done.
First, we need to change the offset at where the string starts.
Code:
NOTE: I'm using spaces for nullbytes because it's easier to visualize.
02 01 2000
^ - Original starting point
02 01 2000
^ - Desired starting point
Well, the 'string' is passed as a pointer to std::cout, so let's change where that pointer actually points to.
A pointer saves a memory address, so first we need to actually access that memory address, then we just add our offset (that we saved in offset[0]) to it.
To access the address that a pointer stores rather than the data it's pointing at, we use the & operator:
Code:
cout << &input << endl;
Now, remember: Strings in C are just arrays of characters, so you might think that we could just call it like this:
Code:
cout << input[offsets[0]] << endl;
So now you're thinking: "Let's type-cast it back to a type char*!"
Guess what: still doesn't work. std::cout also has its own implementations of strlen() and whatnot for when it needs to try and print a certain type of data, in our case a string. But when it does its own function calls here, it also tries to type-cast back to the original type of data.
C++ doesn't like to play nice with all these type-casts, and will throw an exception when it happens. Sorry, I don't make the rules, get pissed off at Stroustrup, not me.
So we need to actually use the offset we saved.
Now, arithmetic would say just do this: (&input + offsets[0]) but that'll also get garbled. That actually adds 33 to the memory address we have above (0x21). The reason is actually quite interesting and had me stumped for a while when I was trying to figure out the behaviour without needing to plug the thing into gdb or IDA.
Quote:33 is oddly close to 32, so that's what I initially thought it might be, but all the data types I was using were 4 bytes (the length of the address &input and the length of the integer at offsets[0])
Then I realized that the string I was using as input (02/01/2000) was exactly eleven characters long, and the value I was trying to add was 3.
3 * 11 = 33.
I null-terminated the input string manually to make it 12 chars, and suddenly I was getting a bad offset of 36 (0x24) or 3 * 12.
So what appears to be happening is that the offset changes not by three, but three times the length of the original piece of data stored at the address, which is indicated by the type of &input.
If you call typeid(&input).name() you get char (*)[11], if you'd like to verify this.
So I'm assuming it tries to just add char[11] three times, or char[33] instead of just adding 3 to the base memory pointer.
So, as a side note, an alternative to what I'm about to do would be this:
(char*)((int)&input + offset[0])
Type-cast the address to an integer first, then add another integer to it, then type-cast it back to a char*.
But that's messy, so we won't do that. There's a much more elegant solution.
Now, back to the subject at hand. How do we print the value at the right offset?
Well, we almost had it earlier when we tried input[offset[0]]. But we just need to use that ampersand character to indicate the memory address.
Code:
cout << &input[offsets[0]] << endl;
So it starts reading the data starting from input[3] which happens to be '0' in our case.
The full code here that will match the output in the original sample is as follows:
Code:
#include <iostream>
using namespace std;
int main() {
char input[] = "02/01/2000";
int offsets[2];
int len = strlen(input);
for (int i = 0, j = 0; i < len; ++i) {
if (input[i] == '/') {
input[i] = '\0';
offsets[j] = i + 1;
++j;
}
}
cout << "Day: " << input << endl;
cout << "Month: " << &input[offsets[0]] << endl;
cout << "Year: " << &input[offsets[1]] << endl;
return 0;
}
Might be a little f*** because of the pointer arithmetic we did at the bottom, but it'll run faster, and when compiled (under Visual Studio 2019, default debug settings) it's 80% smaller in filesize (from 89kb to 49kb.)