Makestation
[Guide] Pointers and Strings - Printable Version

+- Makestation (https://makestation.net)
+-- Forum: Technical Arts (https://makestation.net/forumdisplay.php?fid=45)
+--- Forum: Software (https://makestation.net/forumdisplay.php?fid=104)
+--- Thread: [Guide] Pointers and Strings (/showthread.php?tid=2769)



[Guide] Pointers and Strings - Lain - March 12th, 2020

The Background:

So I recently logged into a forum I used to frequent after hiatus of about 3-4 months. I figured I needed to purge most of my digital life to focus on other stuff at the time, but I've finally lifted the hosts file entry for that site so that I could check in again. There, I found a thread about parsing a string for certain values. Specifically, that string was a datestamp, in the following format:
02/01/2000
Not the exact value, but the one I'll be using here.
That would be 02 January 2000, day/month/year, the true Euro way.

Now, this is a site where people share random code samples or whatever just to say they've 'contributed' something even if it's garbage, so 99% of the code there is garbage, and this thread was no exception.

The Code:
Code:
#include <iostream>
#include <string>

using namespace std;

int main()
{
  int slash;
  int i = 0;
  string in;
  string out[3];

  in = "02/01/2000";
  in = in + "/";

  while(in != "\0")
  {
    slash = in.find("/");

    for(int f = 0; f < slash; f ++)
    {
      out[i] = out[i] + in[f];
    }

    in.erase(0, slash + 1);
    i ++;
  }

  cout << "Day : " << out[0] << endl;
  cout << "Month : " << out[1] << endl;
  cout << "Year : " << out[2] << endl;
}

Returns the following:
Day: 02
Month: 01
Year: 2000
Pretty useless, right? Anyone with half a brain would just use a regex for the data they need, or for the web-developers, just use the JavaScript built-in function var.split('/') to get the three values they need.


So, let's count the bad code practices:
  1. No return value. Not a huge issue, it'll compile fine, but return values are used by the OS to verify that the application executed correctly or if there was an error. 
  2. Two loops, meaning O(n^2) time in this case. The while loop doesn't need to check for a non-nullbyte value because if it was non-nullbyte it would return true anyway.
  3. Added abstraction with the <string> header (for C++)
  4. Use of 'using namespace std;' Not a glaring issue either, but if you're writing larger C++ (OO) programs, blending around namespaces like this can get confusing really fast.
  5. Use of (slow) string methods such as 'find.'
  6. Randomly terminating the string with a / for some reason
  7. Storing the output data in a separate array. std::cout is for outputting stream data, you could just use that (with stream operators and some basic logic) to output the data as needed.
And, as another half-point, I don't like Allman brace styling (i.e. opening curly brace goes on a separate line than the loop/function/struct that uses it) and I generally like the K&R 'one true brace style' of:
Code:
if(cond) {
  doStuff();
} else {
  doSomethingElse();
}

Stroustrup is okay too, especially given this is C++ and he essentially added useless bloat to the glorious C language made the language himself, so the else would be on a different line. That makes for easier-to-read code, anyway.

Side note as to answering the all-eternal question "Why?"
I like to rewrite pieces of code like this because, even though it's useless, it helps me get some practice programming in and also thinking more about how to solve a problem, even if it's trivial.
Not to mention the nerd who wrote it might actually learn something from my solution. We need better devs nowadays, I don't want more JavaScript frameworks that add 1.9MB to each time I try to visit a webpage, I want software that works and is fast and doesn't completely kill my CPU every time I need to do a bit of processing.

What is a String?
If asked this question, you're most likely to respond with 'a set of character-type data' without specifying what kind of character encoding it is (although generally ASCII is the default). Makes sense, and that would be a satisfactory answer, just a 'string' of bytes that are interpreted in a certain way.

But the C language never actually had a 'string' data type, unless you used <string.h> which we might need later on. But still, the C language stores strings as arrays of characters, with a nullbyte at the end to denote when the string ends.

So if you had a variable myString with the value "Hello World", in memory, it would appear as an array like this:
Code:
myString = ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '\0']
Makes sense, right? I mean, each letter needs to take up at least a byte of data anyway, so just putting them together would be the obvious way to do it.

But if you actually use that data in a function like printf() (to print a value) then you also need a format specifier string, so the whole function call would be this:
Code:
printf("%s\n", myString);
%s means that the function should take in a value from the next argument in the function call, and interpret that data as a string-type. The \n is the escape code for a newline character, if you didn't already know.
But again, what exactly is a string and how does the program actually interpret it?

If you try to declare a string in C you might have an issue with the following code:
Code:
char myString[] = "Hello World";
Why though? That seems simple enough?
And if you used the array representation above, you'd be fine.
But double-quoted literals like this example are not of type char[], instead they are of the type const char*. So the program has trouble assigning the type char[] to const char* because const char* should just be a pointer, i.e. storing a memory address (between 4 bytes and 8 bytes depending on 32bit vs. 64bit) of an immutable type char, (one byte) even though we're trying to store a whole twelve characters (eleven if you forgot the nullbyte).
Do note that in recent years, this has changed and most compilers won't really care about that anymore, so the above code sample may very well work (and in my case in Visual Studio 2019 CE, it does work.)

So, back to actual interpretation of the abstract string-type.
Well, remember that nullbyte-termination at the very end?
Basically, what printf() will do is take in the myString argument not as an array of data, but rather as a pointer to where that data starts. It will then increment that pointer, and read the value in memory byte-by-byte, just like a loop, and print the value of each byte.
And once it encounters a nullbyte during its incrementation, it'll stop reading values and finish up the function call with a nice return value.
So that's why string-literal data (double-quoted) is actually of a pointer type, and not a character/array type.

So why does it print byte-by-byte (aside from keeping it simple)? 
The short answer is because that's generally how the operating system API is built. The write() function in the Linux syscall API can only write a single byte, using the registers RSI (where to get the byte), RDI (where to write the byte to), RAX (the syscall code itself, in the case of write() it's 1 on x86_64 architectures), and RDX (optional, only if you want to loop to write multiple bytes.) My experience is mainly for x86_64 assembly (nasm) on Linux so the registers for syscalls will vary between architectures.



Now, the std::string dataype for C++ is a full object with member methods and data fields. As such, it takes a ton of memory space compared to a simple C-string which is just a character array. So the first optimization I'd make to this program is to get rid of anything that has a string type, and consequently remove the <string> header because we don't need it.

I'll leave the 'using namespace std;' part intact because quite frankly in this scenario it really doesn't matter and just makes shorthanding std::endl and std::cout a bit easier.

The Rewrite:
So, we'll declare the original string as a char[] type.
While we're at it, we're going to also declare two pieces of data we will need later as well, offsets and the length of the string.
Code:
char input[] = "02/01/2000";
int offsets[2];
int len = strlen(input);
strlen() is exactly what you think it is, it counts the number of bytes in a (null-terminated) string. It's technically part of the <string.h> C header (<cstring> for C++) and I don't like using too much bloat, but it's a small header file with just a couple functions, so it's not a big deal. And on Windows (Visual Studio) I don't actually need to #include the header; it'll link it automatically during compilation.
We'll need this length later when you see what I do to parse the data, so keep it in mind.
The offsets piece of data is an array with two values, both of type integer. I'm going to use them to record where each piece of data begins and ends. Why only two? Because, just like how arrays start at 0, 2 is the third number from 0.
You'll understand soon enough.


So let's get down to the parsing.
I opted for a single loop, because quite frankly I don't need any more than that. It's a notch down from the original, so I'll take it.
I also need two iterators for this one. If you've used a c-like language before, you know that you can do multiple variable declarations of the same type on one line, so that's what I do for my iterators:
Code:
for(int i = 0, j = 0; i < len; ++i) {

}
But I'm only using one iterator.
You'll see.
You can probably guess from i < len, I'm going to iterate over the string byte-by-byte. And inside the loop, I'm going to have a std::string.find() alternative (because like I said, std::string types are bloat).
Code:
for (int i = 0, j = 0; i < len; ++i) {
  if (input[i] == '/') {
    input[i] = '\0';
    offsets[j] = i + 1;
    ++j;
  }
}

So this should mostly make sense upon first glance.
What it does is when it encounters a character, it replaces it with a different character. In this case, I'm replacing the slash with a nullbyte (\0 is the escape code for a nullbyte). Since our string is an array of characters, it's no different than iterating over an array and finding the value that matches.

So what's up with the offsets[j] line?
Well, the first time it gets encountered is when j = 0, so offsets[0] is the first location Then j gets incremented. Then it encounters the second slash, and by then, j = 1, so it calls offsets[1], i.e. the second offset.
Yes, we increment j again and offsets[2] would be an out-of-bounds exception, but we never make the call to offsets[2] so it's fine.

And in offsets[0] or [1] we save the offset at which int i it encountered a slash in the string, or rather the byte after it.
Do you see where this is going yet? We're almost done, so back to a bit of theory. Remember when I declared len = strlen(input) and I said I'd explain later?

When I first wrote this program, I ran into a weird error that took me a solid minute to figure out what went wrong (using the Windows Debugger and peeking at the input string). 
Why doesn't this work?:
Code:
for (int i = 0, j = 0; i < strlen(input); ++i) {

}
Well, once it encounters the first slash, it turns it into a nullbyte and does that offset magic.
Then int i gets incremented and the loop starts again. At this point, it would be at the value 3 after incrementing.
But when the loop starts up again, it checks the value of int i against strlen(input).
strlen(input) isn't saved, and now that we need to call it again to verify the condition, strlen(input) returns 2, because the string is "02\001/2000" (Notice the \0 escaped nullbyte?) So strlen() only counts the first two characters then stops, and returns 2.
And i = 3, which is greater than 2, not less than 2, so the loop breaks then tries to continue execution, which will give you very, uh, 'bizarre' results.
But this is exactly the behaviour I'm looking for.

So altogether, this turns the original string "02/01/2000" into "02\001\02001" and while it's at it, it also saves the values at which '01' and '2000' begin. (so the offset for 0 (3) and the offset for 2 (6))

Now all that's left is to print it.
This is where we get a little f***, but bear with me.
The original used std::cout because of C++, so we'll do the same:

Code:
cout << "Day: " << input << endl;
As described above about the strlen() problem in the loop, this will print nothing more than '02' and a newline (std::endl is a #define for a newline character in the C++ standard library.)
So what about the rest?
Well, strings are interpreted as pointers, so let's use pointers to get the rest of the job done.

First, we need to change the offset at where the string starts. 
Code:
NOTE: I'm using spaces for nullbytes because it's easier to visualize.

02 01 2000
^ - Original starting point

02 01 2000
   ^ - Desired starting point

Well, the 'string' is passed as a pointer to std::cout, so let's change where that pointer actually points to.
A pointer saves a memory address, so first we need to actually access that memory address, then we just add our offset (that we saved in offset[0]) to it.
To access the address that a pointer stores rather than the data it's pointing at, we use the & operator:
Code:
cout << &input << endl;
This would return the actual address in memory (or rather the allocated memory of the program) which in my case is 0x010FFCC8.

Now, remember: Strings in C are just arrays of characters, so you might think that we could just call it like this:
Code:
cout << input[offsets[0]] << endl;
And you're wrong. That returns a character, because we defined the string as the type char[] (rather, type char) earlier. In our case, it would just return '0' (because of 01).
So now you're thinking: "Let's type-cast it back to a type char*!"
Guess what: still doesn't work. std::cout also has its own implementations of strlen() and whatnot for when it needs to try and print a certain type of data, in our case a string. But when it does its own function calls here, it also tries to type-cast back to the original type of data.
C++ doesn't like to play nice with all these type-casts, and will throw an exception when it happens. Sorry, I don't make the rules, get pissed off at Stroustrup, not me.

So we need to actually use the offset we saved.
Now, arithmetic would say just do this: (&input + offsets[0]) but that'll also get garbled. That actually adds 33 to the memory address we have above (0x21). The reason is actually quite interesting and had me stumped for a while when I was trying to figure out the behaviour without needing to plug the thing into gdb or IDA.

Quote:33 is oddly close to 32, so that's what I initially thought it might be, but all the data types I was using were 4 bytes (the length of the address &input and the length of the integer at offsets[0])
Then I realized that the string I was using as input (02/01/2000) was exactly eleven characters long, and the value I was trying to add was 3.
3 * 11 = 33.

I null-terminated the input string manually to make it 12 chars, and suddenly I was getting a bad offset of 36 (0x24) or 3 * 12.

So what appears to be happening is that the offset changes not by three, but three times the length of the original piece of data stored at the address, which is indicated by the type of &input.
If you call typeid(&input).name() you get char (*)[11], if you'd like to verify this.
So I'm assuming it tries to just add char[11] three times, or char[33] instead of just adding 3 to the base memory pointer.
So, as a side note, an alternative to what I'm about to do would be this:

(char*)((int)&input + offset[0])
Type-cast the address to an integer first, then add another integer to it, then type-cast it back to a char*.
But that's messy, so we won't do that. There's a much more elegant solution.

Now, back to the subject at hand. How do we print the value at the right offset?
Well, we almost had it earlier when we tried input[offset[0]]. But we just need to use that ampersand character to indicate the memory address.
Code:
cout << &input[offsets[0]] << endl;
&input is not of type char, or char[]. It's the type char* already. And if you recall, char* is what strings really are.
So it starts reading the data starting from input[3] which happens to be '0' in our case.

The full code here that will match the output in the original sample is as follows:
Code:
#include <iostream>

using namespace std;

int main() {

  char input[] = "02/01/2000";
  int offsets[2];
  int len = strlen(input);

  for (int i = 0, j = 0; i < len; ++i) {
    if (input[i] == '/') {
      input[i] = '\0';
      offsets[j] = i + 1;
      ++j;
    }
  }

  cout << "Day: " << input << endl;
  cout << "Month: " << &input[offsets[0]] << endl;
  cout << "Year: " << &input[offsets[1]] << endl;

  return 0;
}

Might be a little f*** because of the pointer arithmetic we did at the bottom, but it'll run faster, and when compiled (under Visual Studio 2019, default debug settings) it's 80% smaller in filesize (from 89kb to 49kb.)


RE: [Guide] Pointers and Strings - Darth-Apple - March 12th, 2020

I read through. I was waiting the entire time for you to put "int len = " and to use the for loop with the variable instead of the full statement. And I saw at the end. You so, so made me wait for it. Finna

Honestly, I see a lot of programmers make that particular mistake. What they don't realize is that in many languages, len(some string) is an O(n) operation. A single "for (i = 0; i < strlen(string); i++)" operation becomes an O(n^2) operation just like that.

I'm glad that there is a surge of programmers (along with resources to get them started). The science of things like this is anything but beginner friendly, but at some point, however, it becomes necessary to know. Things like this are exactly why a very solid backing in the inner-workings of low-level code are so important.

Also, I'm very surprised that the OS takes things a single byte at a time. On one hand, operations that involve computations on a single byte have no choice. (32 or 64 bits does not mean 4 or 8 operations. Just that one operation is 32 or 64 bits. And a simple "is this character 'x'" operation is an 8 bit operation fundamentally.) I suppose I'm always a little slow to use string operations for that reason, but when I do use them, I use the available tools. Regex is great (still doesn't make total sense to me, but I have the hang of it enough to know my way around it).

But copying/writing strings is something where I'm surprised on that one. I suppose there is a reason for this too, or perhaps it's convention also (much like 75% of the C++ language. Tongue)

--------------------------------------------------
Regarding 1.9MB Javascript libraries...
I couldn't agree more. Not only with javascript (because the browser has to load these and that takes time), but in principle, just in general. If I can do something WITHOUT a framework, why not? Seriously. 90% of the time, it's not hard to do without. (You already have to learn the base language. With a framework, you have to learn the base language AND the framework. It's like learning two languages to do one thing.)

That's something I LOVE about MyBB. Yes, they use Jquery (I'm okay with that, jquery makes javascript much easier), but they don't use symfony or any other PHP libraries. The pages are in actual .php files. It's not modernized at all, but it's so, so much cleaner and easier to code for. Unfortunately, the world is moving towards more bloated approaches, and people don't appreciate the simplicity of old ways like they used to.

Anyway, thanks for writing this! I really like your post, and learned a lot just reading through it. Smile


RE: [Guide] Pointers and Strings - Lain - March 12th, 2020

(March 12th, 2020 at 2:59 PM)Darth-Apple Wrote: I read through. I was waiting the entire time for you to put "int len = " and to use the for loop with the variable instead of the full statement. And I saw at the end. You so, so made me wait for it. Finna

Honestly, I see a lot of programmers make that particular mistake. What they don't realize is that in many languages, len(some string) is an O(n) operation. A single "for (i = 0; i < strlen(string); i++)" operation becomes an O(n^2) operation just like that.

I'm glad that there is a surge of programmers (along with resources to get them started). The science of things like this is anything but beginner friendly, but at some point, however, it becomes necessary to know. Things like this are exactly why a very solid backing in the inner-workings of low-level code are so important.

Also, I'm very surprised that the OS takes things a single byte at a time. On one hand, operations that involve computations on a single byte have no choice. (32 or 64 bits does not mean 4 or 8 operations. Just that one operation is 32 or 64 bits. And a simple "is this character 'x'" operation is an 8 bit operation fundamentally.) I suppose I'm always a little slow to use string operations for that reason, but when I do use them, I use the available tools. Regex is great (still doesn't make total sense to me, but I have the hang of it enough to know my way around it).

But copying/writing strings is something where I'm surprised on that one. I suppose there is a reason for this too, or perhaps it's convention also (much like 75% of the C++ language. Tongue)

--------------------------------------------------
Regarding 1.9MB Javascript libraries...
I couldn't agree more. Not only with javascript (because the browser has to load these and that takes time), but in principle, just in general. If I can do something WITHOUT a framework, why not? Seriously. 90% of the time, it's not hard to do without. (You already have to learn the base language. With a framework, you have to learn the base language AND the framework. It's like learning two languages to do one thing.)

That's something I LOVE about MyBB. Yes, they use Jquery (I'm okay with that, jquery makes javascript much easier), but they don't use symfony or any other PHP libraries. The pages are in actual .php files. It's not modernized at all, but it's so, so much cleaner and easier to code for. Unfortunately, the world is moving towards more bloated approaches, and people don't appreciate the simplicity of old ways like they used to.

Anyway, thanks for writing this! I really like your post, and learned a lot just reading through it. Smile

One note: I'm pretty sure C's standard-library implementation strlen() is O(nlogn) or something, because strlen itself doesn't actually iterate byte-by-byte.
I remember reading an article on it a while back, it 'accelerates' while trying to read how long the data-type is, which drastically increases performance for larger data types (but not so much smaller ones.)

You'd think you could make a simple implementation of the same function using a single loop or, like four lines of code, but the actual glibc implementation is closer to 25 SLOC (that's without all the comments) as a result.
I've been watching musl-libc as well because I was interested in using VoidLinux a while ago (which takes pride in using musl over glibc for the sake of being lightweight), and musl-libc basically tries to make a tinier glibc while trying to keep most of the functionality and optimizations. It's pretty cool, and their implementation is like 22 lines as well (including other includes and defines)

But either way, calling it n times isn't exactly great either.

I considered using <regex> for this because I've used it in the past for other C++ projects, but sadly that's just more bloat, and it's been known to compile badly on linux machines under gcc/g++. Thus I stayed away from it.
Reading/writing data is pretty cool when you study some assembly to see how it's actually done under the hood, and makes sense considering that the CPU only really has access to one memory address at a time, which should only have the size of one byte.Thus, like I said, RDX can be used if you're specifying multiple bytes to read/write/store, but essentially the syscall just makes a loop to do that.

Loop unrolling makes you think that everything's f*** when it comes to optimizations. I hate it.
But on the other hand my CPU should be able to support 2,700,000,000--3,500,000,000 operations per second per core so f*** it, I can spare a couple extra instructions here and there at the sake of not spending hours writing this haha.


RE: [Guide] Pointers and Strings - Darth-Apple - March 20th, 2020

CPUs are insanely fast these days. the ~3 billion clock cycles/second can result in even more instructions per second, as modern CPUs can exucute 4-5 in parallel (7 in the case of the new iPhone). But I mean, lack of optimization is part of why software has not exactly gotten faster at the rate that CPUs have gotten faster. My Mac still takes about 5-10 seconds to open Word. Heck, a comparable text editor for the time in 2000 would have loaded in about the same amount of time on a Pentium 3.

But hey, I do know that one thing that modern CPUs are focusing on is in the arena of memory optimization. AMD is throwing a ridiculous amount of cache onto their new chips. Generally, you'll only have a very slight increase in performance for every time you double cache. Most of the cache-essential data can be stored in the first ~2MB or so, and the rest only boosts performance slightly, but when you're loading a string that could be megabytes long and you need to quickly calculate the length over and over again, more cache is good.

I'm often amazed at how well optimized glibc truly is. Clearlinux has some optimizations in addition to the standard GNU stack, which amazes me even more.