VOGONS

Common searches


Reply 20 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
Errius wrote:

UTF-16 is a messy kludge. Most of these APIs were originally designed in the 1990s when UTF-16 didn't exist. Shoehorning UTF-16 into routines designed for UCS-2 causes endless headaches.

No different from shoehorning UTF-8 into ASCII, as is done in many other systems.
It is what it is.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 21 of 43, by eL_PuSHeR

User metadata
Rank l33t++
Rank
l33t++

I still can't see the so called benefits of going UEFI. It's pretty bad implemented in my opinion. I have seen some BIOSes with awful UEFI implementations.

MBR partitioning scheme is still working perfectly fine and it's less messy.

Intel i7 5960X
Gigabye GA-X99-Gaming 5
8 GB DDR4 (2100)
8 GB GeForce GTX 1070 G1 Gaming (Gigabyte)

Reply 22 of 43, by dr_st

User metadata
Rank l33t
Rank
l33t
eL_PuSHeR wrote:

MBR partitioning scheme is still working perfectly fine and it's less messy.

Limited to 2 terabyte drives, though, no?

https://cloakedthargoid.wordpress.com/ - Random content on hardware, software, games and toys

Reply 23 of 43, by Jo22

User metadata
Rank l33t++
Rank
l33t++
eL_PuSHeR wrote:

I still can't see the so called benefits of going UEFI. It's pretty bad implemented in my opinion. I have seen some BIOSes with awful UEFI implementations.

This reminds me of this old video ad: https://youtu.be/EUvrXJ2fT_M
When UEFI was introduced, it was advertised as being super userfriendly with a -hold on- graphical user interface with mouse support! 😁
The people of the ad at Asus (for example) clearing hadn't used a WinBIOS before.

"Time, it seems, doesn't flow. For some it's fast, for some it's slow.
In what to one race is no time at all, another race can rise and fall..." - The Minstrel

//My video channel//

Reply 24 of 43, by gdjacobs

User metadata
Rank l33t++
Rank
l33t++
dr_st wrote:
eL_PuSHeR wrote:

MBR partitioning scheme is still working perfectly fine and it's less messy.

Limited to 2 terabyte drives, though, no?

All my big drives have raw storage on them. No partition table required.

All hail the Great Capacitor Brand Finder

Reply 25 of 43, by Jo22

User metadata
Rank l33t++
Rank
l33t++

My backup drive is 5TB in capacity and uses NTFS and a MBR "partition" (or none at all ?). Since it's interface is USB 3.x, partition limits are no issue.
The controller handles the drive geometry internally and does the translation to the host.
Even Windows XP x86 can access it safely without any trouble or restrictions.

Anyway, if the enclosure fails at some point, things are getting a bit complicated, of course. 😉
I'd have to mount it read-only in macOS, Linux, BSD, etc and pray retrieving stuff works somehow. 😅

"Time, it seems, doesn't flow. For some it's fast, for some it's slow.
In what to one race is no time at all, another race can rise and fall..." - The Minstrel

//My video channel//

Reply 26 of 43, by SirNickity

User metadata
Rank Oldbie
Rank
Oldbie

MBR and LBA are two totally different factors in drive size. And I'm not really sure how we got to MBR from UEFI.

Just in case any of these things are being confused:
UEFI is a BIOS replacement.
LBA is a block-level addressing method used instead of Cyl, Head, Sectors -- which is never physically accurate since forever ago, and has many (different) limitations on how large those numbers can be based on which BIOS or OS you're talking about.
MBR is a partition table, and has fixed-width fields to track the starting point and size of partitions. It can use LBA or CHS.

So, if you have a >2TB drive, you can't describe the size of a single partition with a 32-bit LBA field. Ergo, with an MBR table, you can have a max part size of 2TB. You can have 2x 2TB partitions on a 4TB disk, but that's pretty much the end of the road. Your only alternatives are to use GPT or no partition table at all (which is what I do on my NAS drives.)

Reply 27 of 43, by SirNickity

User metadata
Rank Oldbie
Rank
Oldbie
Jo22 wrote:

Well, VB Classic vs VB .net is a bit like being a bus driver vs a fighter jet pilot. 😉
Claiming that a VB.Net/C++ programmer is making better programs per se is misleading.

I totally agree with you. The accessibility of VB is what got me into Windows programming, and allowed me (and swaths of others) to quickly and easily make a lot of useful applications. That doesn't change the fact that the language itself is terribly designed. And, since the barrier of entry is pretty low, many VB programs are likewise written by people (90s me says hello) with very little experience designing interfaces, writing sensible code, etc., so those programs are also often... terrible. 😀 But, you gotta start somewhere, and better terrible programs than nonexistent ones, eh?

SquallStrife wrote:

Classic VB has a well-documented and consistent way of making calls into non-ActiveX DLLs. ... In fact, it's not too far removed from what you'd do in C# to make calls into native/unmanaged code

Yep, but.. there are a lot of things you can't do without the help of add-on modules (VBX / OCX) -- like, hook into messages between the OS and application. All you get by default is the baked-in events. There are workaround (using a proxy control to pass those messages on to your VB code), but it's a kludge.

SquallStrife wrote:

I think there are some gaps in your knowledge, it's not really a "mess" of APIs any more than the array of libc variants, desktop environments, and widget toolkits available to you on Linuxes.

True beyond a doubt, on both counts. 😀 Everything I've written in Linux so far has been CLI -- partly due to a lack of necessity for a GUI so far, but also because of the quagmire that is Linux window managers and graphics toolkits. Yikes!

Scali wrote:

[ lots of useful stuff ]

I remember the ASCII / Wide API calls from back in my VB days. What confused me most is TCHAR, since I would see some code examples written with ASCII syntax, some with multi-byte data types, and some with TCHAR -- which was totally new to me. By that point, I had no idea how to declare a string type, and without being able to print text to the console or specify filenames, I was pretty much done before I started. (Yes, I'm aware "string" isn't a type in C -- I just mean an array of chars meant to be used as text.)

Your concise explanation of what TCHAR really does is very helpful, thank you. I still think the "UTF-8 in ASCII" method is so much more elegant and straight-forward. Your app can migrate from ASCII to UTF-8 pretty much without changes. The only thing you lose is (e.g.) the ability to count the length of strings by the number of bytes, so you have to start using Unicode-aware functions when that kind of thing matters. But you can get a LOT done before you have to care about what's actually in those strings. Granted, Microsoft was in uncharted territory with UCS-2, so some of this is hindsight. Having not migrated to something more sensible since then is really the issue I have, and now there's so much legacy code that it's unlikely to happen.

Reply 29 of 43, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie
SirNickity wrote:

So, if you have a >2TB drive, you can't describe the size of a single partition with a 32-bit LBA field. Ergo, with an MBR table, you can have a max part size of 2TB. You can have 2x 2TB partitions on a 4TB disk, but that's pretty much the end of the road. Your only alternatives are to use GPT or no partition table at all (which is what I do on my NAS drives.)

If you have a >2TB drive it will likely have 4KB sectors meaning the maximum size of an MBR partition (restricted by 32-bit LBA fields) is 16TB.

Reply 30 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
SirNickity wrote:

I still think the "UTF-8 in ASCII" method is so much more elegant and straight-forward.

I don't.
Firstly, "UTF-8 in ASCII" is mostly equivalent to "UTF-16 in UCS-2" obviously (the Win32 API was natively meant as a Unicode API, with UCS-2 being the original default. ASCII was only available to make legacy code from Win16 easier to port... The fact that Win9x also focused on ASCII didn't quite help there).

Secondly, I think both are pretty bad... Having characters that may or may not take more than one array element is just not very easy to handle.
The big advantage of UTF-16 over UTF-8 is that because your array elements are 16-bits instead of 8-bits, the occurence of multi-element characters is much lower, and therefore the chance of possible errors is much smaller. Most languages fit fine in UCS-2, so you never actually need multi-element characters at all. So UTF-16 in UCS-2 is the more elegant and straightforward one.
You basically only get errors with 'esoteric' languages such as Asian languages. Languages you could never use ASCII for in the first place, so you presumably are already using unicode and you already know what you're doing.

SirNickity wrote:

Your app can migrate from ASCII to UTF-8 pretty much without changes.

Oh the bugs I've had to fix from all those developers who can't get their heads around UTF-8...
This is an EXTREMELY error-prone process.
If you ask me, UTF-8, UTF-16 and other variations should mainly be used for storage, not for actual processing. Saves a lot of headaches and inefficiencies.
I also think you are over-simplifying ASCII. Remember that ASCII is only 127 characters. Which meant that you always had to use a specific codepage for your language, even 'simple' Western ones such as German or English, to define the 'extra' characters in the top 128 values of the byte, which obviously affects how the bytes are mapped to UTF-8 (also, not converting them means that they will all be interpreted as multi-byte sequences in UTF-8, which will break your application). The conversion between (extended) ASCII and UTF-8 depends on choosing the proper codepage, which 99 out of 100 developers get wrong most of the time.
This problem doesn't really exist in UCS-2, as at least all Western languages fit into the standard range of UCS-2.

SirNickity wrote:

The only thing you lose is (e.g.) the ability to count the length of strings by the number of bytes, so you have to start using Unicode-aware functions when that kind of thing matters.

Which is the same for UTF-8 and UTF-16 as I said above.

SirNickity wrote:

Having not migrated to something more sensible since then is really the issue I have, and now there's so much legacy code that it's unlikely to happen.

I think UTF-16 is the more sensible one, at least for actual manipulation. Storage/transport is best done in UTF-8 because it is more compact. But conversion between UTF-8/16 is trivial: just a different way to pack the same bits into bytes/words.
The more fundamental problem is that the C-style string with zero-terminator is fundamentally a poor design for handling unicode.
Which is why modern languages such as C# and Java take the Pascal approach where you store the length of the string. This avoids unnecessary strlen() calls, and also guarantees that your string length is the one you expect with UTF-8 or UTF-16 encoding: the count of unicode characters.
You shouldn't use C for that, but C++, where similar string classes exist, to make string manipulation more efficient and less error prone than the 70s approach of zero-terminated arrays.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 31 of 43, by SirNickity

User metadata
Rank Oldbie
Rank
Oldbie

I'm going to admit that I'm not a professional software developer, and so I could very well be committing some of those errors that you speak of -- just being ignorant of what I don't know I'm doing wrong. But.... with UTF-8, the difference in the way I do the vast vast majority of what I have to do with strings is 100% transparent to me whether the user has set their terminal to a codepage, or UTF-8. If all I'm doing is printing text that I got from input, or files, or API calls, and I don't specifically need to deal with "how many bytes vs. how many characters", then there isn't really anything for me to worry about.

That isn't quite true of Win32, right? IIUC, you kind of have to pick a side since all the API calls require ASCII or 16-bit chars.

The choice of UCS-2 (and later UTF-16) probably made sense at the time, but frankly, it's just the worst of all possible options. It takes more space than UTF-8 always, unless you're creating content comprised SOLELY of multi-byte characters (which AFAIK is pretty uncommon), and it requires you to deal with multiple bytes even when you're never going to use more than the basic Western Latin alphabet, or just numbers. On the flip side, you don't get the luxury of knowing that EVERY character you will ever handle is the same width, as you would with...

Errius wrote:

Why can't we just have 4-byte chars?

... that. UTF-32.

I started writing a FAT library for funsies, and that was what introduced me to handling Unicode. Everything up to that point "just worked," but now I had to read data from a file whose codepage was not necessarily the same as the local system's, and in fact, isn't defined in the metadata anywhere, so you're guessing at worst, and taking the user's word for it (by asking them to specify what it would've been) at best. At that point, Unicode support became something I needed to know. Up til then, it was just handled as a matter of course by the system API. ***

(For those following along, FAT allows char values >127 in file names, but requires that all names be case-folded to uppercase -- which can only be done accurately if you know what character values in the set of 128-255 are letters in a given language / codepage, and what the equivalent uppercase character would be. I suspect many FAT implementations don't bother with this, and technically allow mixed case, and case-insensitive name collisions.)

*** EDIT: OK, no, now I'm realizing that I was indeed just ignorant. If I had been dealing with text files and treating them all as if they were "local codepage" then I would've been in the exact same wrong position. Making assumptions that weren't true. In that case, I kind of see your point... the hassle of UTF-16 means you're forced to consider Unicode, where UTF-8 allows the false comfort of thinking you're doing the right thing when you may not be. Hm. Sudden realization that I don't know as much as I think I do. That's always fun. ;-D

Reply 32 of 43, by Errius

User metadata
Rank l33t
Rank
l33t

Zipped UTF-32 is probably the textfile format of the future.

I understand that in some Asian languages UTF-16 produces smaller files than UTF-8.

Is this too much voodoo?

Reply 33 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
SirNickity wrote:

But.... with UTF-8, the difference in the way I do the vast vast majority of what I have to do with strings is 100% transparent to me whether the user has set their terminal to a codepage, or UTF-8.

It isn't, that's my point.
If you *strictly* stick to pure ASCII, so only values 0-127, then you're okay.
However, in many languages, you're not.
For example, in my language, you will use characters like É, Ë, È etc.
These characters do NOT fall in the range of 0-127. What's more, their value differs depending on which codepage I use.
In codepage 850 (default for Europe), they are codes 144, 211 and 212.
In codepage 437 (default for US), the first character is code 144, but the other two simply aren't in the codepage at all.
In unicode, they are 201, 203 and 200.

Which means that ASCII text in my language is always extended ASCII, and usually with codepage 850. Codepage 437 simply can't express all characters. If I mistakenly use codepage 437 (or any other codepage), the characters in the string will be incorrect.
Also, if I were to just interpret the string as UTF-8, it would also break, because these characters would be interpreted as multi-byte sequences, since their value is > 127 (eg a strlen() for UTF-8 would give different results than a strlen() for codepage 850).
So I can only interpret this string correctly (and convert it to UTF-8 properly), if I know that it is actually regular 8-bit extended ASCII encoding, and which codepage it uses.

As you can see, it is not transparent at all, because codepage 437, codepage 850 and unicode, the three examples I've picked here, all have different encodings for the same characters (which as I said are regular characters in my language, even though they might not be in US English).

SirNickity wrote:

If all I'm doing is printing text that I got from input, or files, or API calls, and I don't specifically need to deal with "how many bytes vs. how many characters", then there isn't really anything for me to worry about.

There is, see above. You were probably just blissfully ignorant of it.

SirNickity wrote:

That isn't quite true of Win32, right? IIUC, you kind of have to pick a side since all the API calls require ASCII or 16-bit chars.

As I said, that is basically just 'syntactic sugar' handled by how you configure your project.
The Win32 API simply has both versions for all relevant calls, with either an A or W suffix.
You can call these directly. So with my previous example:

MessageBox(hwnd, text, caption, MB_OK);  // Calls the default for your project settings, either ASCII or Unicode
MessageBoxA(hwnd, text, caption, MB_OK); // Calls the ASCII version of the API, text and caption need to be ASCII strings
MessageBoxW(hwnd, text, caption, MB_OK); // Calls the unicode version of the API, text and caption need to be UTF-16 strings

In practice you'd never use the A or W directly. You just pick one and consistently use the same type of strings throughout your application, and convert any strings from external sources to your internal representation immediately, to avoid errors.

SirNickity wrote:

It takes more space than UTF-8 always, unless you're creating content comprised SOLELY of multi-byte characters (which AFAIK is pretty uncommon)

That probably was not seen as an issue. The Win32 API mainly handles strings that you want to display on the GUI. How much text can you really put on your screen?
As said, larger strings can be stored in UTF-8 and converted on-the-fly.

Where the gains originally were with UCS-2 is that ALL characters would fit into 16-bit. This meant that no string APIs had to worry about multi-word encodings. With UTF-8 you'd always have use a more complex strlen() function, and likewise with any character-based manipulation you'd always have to take care of multi-byte encodings.

This was eventually negated by the fact that unicode was expanded to more than 65535 codepoints, so even UCS-2 didn't cut it anymore, and it was extended to UTF-16.
However, you still have the advantage that you don't need to mess with codepages for most encodings, which did fit in UCS-2.

SirNickity wrote:

... that. UTF-32.

Yes, the UTF-32 of today is the UCS-2 of the 90s. All codepoints fit into 32-bit... for now.

SirNickity wrote:

Making assumptions that weren't true. In that case, I kind of see your point... the hassle of UTF-16 means you're forced to consider Unicode, where UTF-8 allows the false comfort of thinking you're doing the right thing when you may not be. Hm. Sudden realization that I don't know as much as I think I do. That's always fun. ;-D

Exactly, UTF-8 is more error-prone.
I wrote a blog about some of the 'classic' problems that some developers never seem to get their head around. Unicode was one of them:
https://scalibq.wordpress.com/2017/01/24/prog … -from-the-boys/

I've seen so much code that *tries* to do some conversions from ASCII or UTF-8 to unicode and back, and so many times they get it completely wrong. But with most simple test-cases, it all looks sorta right. Regular ASCII text will be fine, and strlen() will only be off by a few bytes at most in practice.

When I have to fix such bugs, and talk to the developer in question, often they seem clueless to character encoding.
I recall some guy trying to debug some string data containing an XML coming in from some message bus system. He had some garbled characters, and couldn't figure it out.
So I just showed him how to look into the memory with the debugger, then I pointed out: "Look, you're expecting a special character here, but apparently there's only one byte encoding it, so apparently you're not getting UTF-8. You need to convert it to UTF-8 first, before feeding it to the XML parser".
Dude was totally amazed that you could just use the debugger to look in memory in the first place, and that characters just appear as numbers in memory. Let alone that someone would actually know from the top of their head which numbers to expect for which characters, and thereby deriving the character encoding on-the-fly.
For many programmers, data, characters, strings, objects and such are just abstract concepts. They have little idea of how that is stored in memory.

Reminds me of another time when I showed another programmer the basics of a Von Neumann architecture: code is data. You can just look at your code in memory. What's more, you can actually read and write that code from other code. Total amazement ensued. From both sides. Because I was amazed that people can actually write working code, and even get a paid programming job, when they don't actually know how strings, code or whatever is stored in memory, and what the basics of a computer really are. I thought Von Neumann was a concept that everyone would have heard of in the first year of whatever software/computer-related education.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 34 of 43, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

Also, if I were to just interpret the string as UTF-8, it would also break, because these characters would be interpreted as multi-byte sequences, since their value is > 127 (eg a strlen() for UTF-8 would give different results than a strlen() for codepage 850).

This isn't true because strlen() returns the length of a string in bytes, not characters. It's just going to return the index of the first byte equal to 0 regardless of the set codepage.

Reference: https://www.gnu.org/software/libc/manual/html … ing-Length.html

Reply 35 of 43, by Errius

User metadata
Rank l33t
Rank
l33t

The code point (i.e. character) length of a string is distinct from the code unit length, and only in ASCII (0-127) are they always the same.

(In ASCII and UTF-8, code units are 1 byte. In UCS-2 and UTF-16 they are 2 bytes. In UTF-32 they are 4 bytes.)

The various strlen functions return the number of code points in the string, but they don't tell you the number of code units, and therefore you need some other way of determining the byte-length of the string.

This problem isn't even just confined to Unicode. In some of the non-Unicode East Asian code pages, you have certain characters that are encoded with 2 bytes, so that strlen [or rather mbslen] will again not reliably give you the string's byte length.

Is this too much voodoo?

Reply 36 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
jmarsh wrote:

This isn't true because strlen() returns the length of a string in bytes, not characters. It's just going to return the index of the first byte equal to 0 regardless of the set codepage.

Reference: https://www.gnu.org/software/libc/manual/html … ing-Length.html

I think you're missing the point here.
Obviously I didn't mean literally the name 'strlen'... But the functional equivalent depending on whatever character encoding you use (as in the only semantically meaningful type of strlen, which in the case of TCHAR will always be _tcslen()):
https://docs.microsoft.com/en-us/cpp/c-runtim … -l?view=vs-2019

"Each of these functions returns the number of characters in str, excluding the terminal null"

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 37 of 43, by jmarsh

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

I think you're missing the point here.

I'm not.
Getting the number of characters is not what you want if you need to allocate storage for a string. Using _tcslen/mbslen on a UTF-8 string for that purpose would be a mistake, so it is obviously not "the only semantically meaningful type of strlen".

The whole main reason UTF-8 exists is because in C it can mostly be treated the same as regular NUL terminated strings - existing functions that only distinguish between NUL and non-NUL characters like strlen(), strcpy(), strcat() etc. all work correctly.

Reply 38 of 43, by Scali

User metadata
Rank l33t
Rank
l33t
jmarsh wrote:

I'm not.
Getting the number of characters is not what you want if you need to allocate storage for a string. Using _tcslen/mbslen on a UTF-8 string for that purpose would be a mistake, so it is obviously not "the only semantically meaningful type of strlen".

Semantically, no. You're talking about something that is semantically different from what we were talking about. So you were missing the point.
What you're describing would semantically be something like 'buflen'. You want the length of the buffer, in bytes.
strlen should give you the length of the string, where a string is a collection of characters, hence the character count.

wcslen() more or less proves my point there. It returns the length in characters, not in bytes (and you can't use strlen() on a wide character string, it will stop at the first zero-byte, so at the first code point <= 255).
So that's the semantics. If they wanted you to use these functions for allocation, they wouldn't use character count.

jmarsh wrote:

The whole main reason UTF-8 exists is because in C it can mostly be treated the same as regular NUL terminated strings - existing functions that only distinguish between NUL and non-NUL characters like strlen(), strcpy(), strcat() etc. all work correctly.

Well, on strlen() we obviously disagree. strcpy() and strcat() happen to work. Something like strtok(), strchr() will break.
Which brings us back to the conclusion that UTF-8 may falsely give you the idea that your code will work.

It's not only MSDN that defines strlen() as the length of a string in characters.
cplusplus.com does the same: http://www.cplusplus.com/reference/cstring/strlen/

The length of a C string is determined by the terminating null-character: A C string is as long as the number of characters between the beginning of the string and the terminating null character (without including the terminating null character itself).

This should not be confused with the size of the array that holds the string.

Last edited by Scali on 2019-09-12, 23:04. Edited 1 time in total.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 39 of 43, by Errius

User metadata
Rank l33t
Rank
l33t

The multibyte functions like mbslen are for Extended ASCII codepages that use 2 bytes for some characters. (Like Japanese Shift-JIS). When using regular single-byte code pages (like 437) they function identically to the equivalent 1970s ASCII functions. This has nothing to do with Unicode.

Last edited by Errius on 2019-09-12, 23:10. Edited 2 times in total.

Is this too much voodoo?