标签云

微信群

扫码加入我们

WeChat QR Code

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here.Also, why should encoding be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?


Every string is stored as an array of bytes right? Why can't I simply have those bytes?

2019年07月21日59分53秒

The encoding is what maps the characters to the bytes. For example, in ASCII, the letter 'A' maps to the number 65. In a different encoding, it might not be the same. The high-level approach to strings taken in the .NET framework makes this largely irrelevant, though (except in this case).

2019年07月20日59分53秒

To play devil's advocate: If you wanted to get the bytes of an in-memory string (as .NET uses them) and manipulate them somehow (i.e. CRC32), and NEVER EVER wanted to decode it back into the original string...it isn't straight forward why you'd care about encodings or how you choose which one to use.

2019年07月20日59分53秒

Surprised no-one has given this link yet: joelonsoftware.com/articles/Unicode.html

2019年07月20日59分53秒

A char is not a byte and a byte is not a char. A char is both a key into a font table and a lexical tradition. A string is a sequence of chars. (A words, paragraphs, sentences, and titles also have their own lexical traditions that justify their own type definitions -- but I digress). Like integers, floating point numbers, and everything else, chars are encoded into bytes. There was a time when the encoding was simple one to one: ASCII. However, to accommodate all of human symbology, the 256 permutations of a byte were insufficient and encodings were devised to selectively use more bytes.

2019年07月20日59分53秒

What's ugly about this one is, that GetString and GetBytes need to executed on a system with the same endianness to work. So you can't use this to get bytes you want to turn into a string elsewhere. So I have a hard time to come up with a situations where I'd want to use this.

2019年07月21日59分53秒

CodeInChaos: Like I said, the whole point of this is if you want to use it on the same kind of system, with the same set of functions. If not, then you shouldn't use it.

2019年07月20日59分53秒

-1 I guarantee that someone (who doesn't understand bytes vs characters) is going to want to convert their string into a byte array, they will google it and read this answer, and they will do the wrong thing, because in almost all cases, the encoding IS relevant.

2019年07月20日59分53秒

artbristol: If they can't be bothered to read the answer (or the other answers...), then I'm sorry, then there's no better way for me to communicate with them. I generally opt for answering the OP rather than trying to guess what others might do with my answer -- the OP has the right to know, and just because someone might abuse a knife doesn't mean we need to hide all knives in the world for ourselves. Though if you disagree that's fine too.

2019年07月20日59分53秒

This answer is wrong on so many levels but foremost because of it's decleration "you DON'T need to worry about encoding!". The 2 methods, GetBytes and GetString are superfluous in as much as they are merely re-implementations of what Encoding.Unicode.GetBytes() and Encoding.Unicode.GetString() already do. The statement "As long as your program (or other programs) don't try to interpret the bytes" is also fundamentally flawed as implicitly they mean the bytes should be interpreted as Unicode.

2019年07月21日59分53秒

But, why should encoding be taken into consideration? Why can't I simply get the bytes without having to see what encoding is being used? Even if it were required, shouldn't the String object itself know what encoding is being used and simply dump what is in memory?

2019年07月20日59分53秒

A .NET strings are always encoded as Unicode.So use System.Text.Encoding.Unicode.GetBytes(); to get the set of bytes that .NET would using to represent the characters.However why would you want that?I recommend UTF-8 especially when most characters are in the western latin set.

2019年07月20日59分53秒

Also: the exact bytes used internally in the string don't matter if the system that retrieves them doesn't handle that encoding or handles it as the wrong encoding. If it's all within .Net, why convert to an array of bytes at all. Otherwise, it's better to be explicit with your encoding

2019年07月20日59分53秒

Joel, Be careful with System.Text.Encoding.Default as it could be different on each machine it is run.That's why it's recommended to always specify an encoding, such as UTF-8.

2019年07月20日59分53秒

You don't need the encodings unless you (or someone else) actually intend(s) to interpret the data, instead of treating it as a generic "block of bytes". For things like compression, encryption, etc., worrying about the encoding is meaningless. See my answer for a way to do this without worrying about the encoding. (I might have given a -1 for saying you need to worry about encodings when you don't, but I'm not feeling particularly mean today. :P)

2019年07月20日59分53秒

The accepted answer is not only very complicated but also a recipe for disaster.

2019年07月20日59分53秒

In case the accepted answer gets changed, for record purposes, it is Mehrdad's answer at this current time and date. Hopefully the OP will revisit this and accept a better solution.

2019年07月20日59分53秒

good in principle but, the encoding should be System.Text.Encoding.Unicode to be equivalent to Mehrdad's answer.

2019年07月20日59分53秒

AMissico, your suggestion is buggy, unless you are sure your string is compatible with your system default encoding (string containing only ASCII chars in your system default legacy charset). But nowhere the OP states that.

2019年07月21日59分53秒

AMissico It can cause the program to give different results on different systems though. That's never a good thing. Even if it's for making a hash or something (I assume that's what OP means with 'encrypt'), the same string should still always give the same hash.

2019年07月21日59分53秒

You could use the same BinaryFormatter instance for all of those operations

2019年07月20日59分53秒

Very Interesting. Apparently it will drop any high surrogate Unicode character. See the documentation on [BinaryFormatter]

2019年07月21日59分53秒

ErikA.Brandstadmoen See my tests here: stackoverflow.com/a/10384024

2019年07月20日59分53秒

"1 character could be represented by 1 or more bytes" I agree. I just want those bytes regardless of what encoding the string is in. The only way a string can be stored in memory is in bytes. Even characters are stored as 1 or more bytes. I merely want to get my hands on them bytes.

2019年07月20日59分53秒

You don't need the encodings unless you (or someone else) actually intend(s) to interpret the data, instead of treating it as a generic "block of bytes". For things like compression, encryption, etc., worrying about the encoding is meaningless. See my answer for a way to do this without worrying about the encoding.

2019年07月20日59分53秒

Mehrdad - Totally, but the original question, as stated when I initially answered, didn't caveat what OP was going to happen with those bytes after they'd converted them, and for future searchers the information around that is pertinent - this is covered by Joel's answer quite nicely - and as you state within your answer: provided you stick within the .NET world, and use your methods to convert to/from, you're happy. As soon as you step outside of that, encoding will matter.

2019年07月21日59分53秒

One code point can be represented by up to 4 bytes. (One UTF-32 code unit, a UTF-16 surrogate pair, or 4 bytes of UTF-8.) The values that UTF-8 would need more than 4 bytes for are outside the 0x0..0x10FFFF range of Unicode. ;-)

2019年07月20日59分53秒

In general, is not correct to set byteCount to twice the string length.For Unicode code points outside the Basic Multilingual Plane, there will be two 16-bit code units for each character.

2019年07月20日59分53秒

Jan That's correct but string length already gives the number of code-units (not codepoints).

2019年07月20日59分53秒

Thanks for pointing that out!From MSDN:"The Length property [of String] returns the number of Char objects in this instance, not the number of Unicode characters."Your example code is therefore correct as written.

2019年07月20日59分53秒

TomBlodget: Interestingly, if one takes instances of Globalization.SortKey, extracts the KeyData, and packs the resulting bytes from each into a String [two bytes per character, MSB first], calling String.CompareOrdinal upon the resulting strings will be substantially faster than calling SortKey.Compare on the instances of SortKey, or even calling memcmp on those instances.Given that, I wonder why KeyData returns a Byte[] rather than a String?

2019年07月20日59分53秒

TomBlodget: You don't need fixed or unsafe code, you can also do var gch = GCHandle.Alloc("foo", GCHandleType.Pinned); var arr = new byte[sizeof(char) * ((string)gch.Target).Length]; Marshal.Copy(gch.AddrOfPinnedObject(), arr, 0, arr.Length); gch.Free();

2019年07月20日59分53秒

Don't surrogates have to appear in pairs to form valid code points?If that's the case, I can understand why the data would be mangled.

2019年07月20日59分53秒

dtanders Yes,that's my thoughts too, they have to appear in pairs, unpaired surrogate characters just happen if you deliberately put them on string and make them unpaired. What I don't know is why other devs keep on harping that we should use encoding-aware approach instead, as they deemed the serialization approach(my answer,which was an accepted answer for more than 3 years) doesn't keep the unpaired surrogate character intact. But they forgot to check that their encoding-aware solutions doesn't keep the unpaired surrogate character too,the irony ツ

2019年07月20日59分53秒

If there's a serialization library that uses System.Buffer.BlockCopy internally, all encoding-advocacy folks' arguments will be moot

2019年07月21日59分53秒

MichaelBuen It seem to me that the main issue is that you are in big bold letters saying something doesn't matter, rather than saying that it does not matter in their case.As a result, you are encouraging people who look at your answer to make basic programming mistakes which will cause others frustration in the future.Unpaired surrogates are invalid in a string.It is not a char array, so it makes sense that converting a string to another format would result in an error FFFD on that character.If you want to do manual string manipulation, use a char[] as recommended.

2019年07月20日59分53秒

dtanders: A System.String is an immutable sequence of Char; .NET has always allowed a String object to be constructed from any Char[] and export its content to a Char[] containing the same values, even if the original Char[] contains unpaired surrogates.

2019年07月20日59分53秒

There are areas in .NET where you do have to get byte arrays for strings.Many of the .NET Cryptrography classes contain methods such as ComputeHash() that accept byte array or stream.You have no alternative but to convert a string to a byte array first (choosing an Encoding) and then optionally wrap it in a stream.However as long as you choose an encoding (ie UTF8) an stick with it there are no problems with this.

2019年07月21日59分53秒

Then try this System.Text.Encoding.UTF8.GetBytes("Árvíztűrő tükörfúrógép);, and cry! It will work, but System.Text.Encoding.UTF8.GetBytes("Árvíztűrő tükörfúrógép").Length != System.Text.Encoding.UTF8.GetBytes("Arvizturo tukorfurogep").Length while "Árvíztűrő tükörfúrógép".Length == "Arvizturo tukorfurogep".Length

2019年07月20日59分53秒

mg30rg: Why do you think your example is strange? Surely in a variable-width encoding not all characters have the same byte lengthes. What's wrong with it?

2019年07月20日59分53秒

Instead of using your custom method to convert a byte array to base64, all you had to do was use the built-in converter:Convert.ToBase64String(arr);

2019年07月21日59分53秒

Makotosan thank you, but I did use Convert.ToBase64String(arr); for the base64 conversions byte[] (data) <-> string (serialized data to store in XML file).But to get the initial byte[] (data) I needed to do something with a String that contained binary data (it's the way MSSQL returned it to me). SO the functions above are for String (binary data) <-> byte[] (easy accessible binary data).

2019年07月20日59分53秒

Allow me to clarify: An encoding has been used to translate "hello world" to physical bytes. Since the string is stored on my computer, I am sure that it must be stored in bytes. I merely want to access those bytes to save them on disk or for any other reason. I do not want to interpret these bytes. Since I do not want to interpret these bytes, the need for an encoding at this point is as misplaced as requiring a phone line to call printf.

2019年07月20日59分53秒

But again, there is no concept of text-to-physical-bytes-translation unless yo use an encoding. Sure, the compiler stores the strings somehow in memory - but it is just using an internal encoding, which you (or anyone except the compiler developer) do not know. So, whatever you do, you need an encoding to get physical bytes from a string.

2019年07月20日59分53秒

Agnel Kurian: It is of course true, that a string has a bunch of bytes somewhere that store its content (UTF-16 afair). But there is a good reason to prevent you from accessing it: strings are immutable and if you could obtain the internal byte[] array, you could modify it, too. This breaks immutability, which is vital because multiple strings may share the same data. Using an UTF-16 encoding to get the string will probably just copy the data out.

2019年07月20日59分53秒

Gnafoo, A copy of the bytes will do.

2019年07月20日59分53秒

VUPthis one solved my problem( byte[] ff = ASCIIEncoding.ASCII.GetBytes(barcodetxt.Text);)

2019年07月21日59分53秒

But, why should encoding be taken into consideration? Why can't I simply get the bytes without having to see what encoding is being used?Even if it were required, shouldn't the String object itself know what encoding is being used and simply dump what is in memory?

2019年07月20日59分53秒

This doesn't always work.Some special characters can get lost in using such a method I've found the hard way.

2019年07月21日59分53秒

if the charset was utf it wouldn't work!

2019年07月21日59分53秒

UTF-8 is compact only if the majority of your characters are in the English (ASCII) character set. If you had a long string of Chinese characters, UTF-16 would be a more compact encoding than UTF-8 for that string. This is because UTF-8 uses one byte to encode ASCII, and 3 (or maybe 4) otherwise.

2019年07月20日59分53秒

True.But, how can you not know about encoding if you're familiar with handling Chinese text?

2019年07月20日59分53秒

ASCIIEncoding..... is not needed. Simply using Encoding.UTF8.GetBytes(text) is preferred.

2019年07月20日59分53秒

OP specifically asks to NOT specify an encoding... "without manually specifying a specific encoding"

2019年07月21日59分53秒

These characters are not supported by UTF-8 or UTF-16 or even UTF-32 for exapmle: 񩱠 & (Char) 55906 & (Char) 55655. So you may be wrong and Mehrdad's answer is a safe conversion without considering what type of encodings are used.

2019年07月21日59分53秒

Raymon, the characters are already represented by some unicode value -- and all unicode values can be represented by all the utf's.Is there a longer explanation of what you are talking about?What character encoding do those two values (or 3..) exist in?

2019年07月20日59分53秒

They are invalid characters which not supported by any encoding ranges. This not means they are 100% useless. A code which converts any type of string to its byte array equivalent regardless of the encodings is not a wrong solution at all and have its own usages on desired occasions.

2019年07月21日59分53秒

Ok, then I think you are not understanding the problem.We know it is a unicode compliant array -- in fact, because it is .net, we know it is UTF-16.So those characters will not exist there.You also didn't fully read my comment about internal representations changing.A String is an object, not an encoded byte array. So I'm going to disagree with your last statement.You want code to convert all unicode strings to any UTF encoding.This does what you want, correctly.

2019年07月20日59分53秒

Objects are sequence of data originally sequence of bits which describe an object in its current state. So every data in programming languages are convertible to array of bytes(each byte defines 8 bits) as you may need to keep some state of any object in memory. You can save and hold a sequence of bytes in file or memory and cast it as integer, bigint, image, Ascii string, UTF-8 string, encrypted string, or your own defined datatype after reading it from disk. So you can not say objects are something different than bytes sequence.

2019年07月20日59分53秒

Is the value of RuntimeHelpers.OffsetToStringData a multiple of 8 on the Itanium versions of .NET? Because otherwise this will fail due to the unaligned reads.

2019年07月21日59分53秒

wouldn't it be simpler to invoke memcpy? stackoverflow.com/a/27124232/659190

2019年07月20日59分53秒

...and lose all characters with a jump cope higher than 127. In my native language it is perfectly valid to write "Árvíztűrő tükörfúrógép.". System.Text.ASCIIEncoding.Default.GetBytes("Árvíztűrő tükörfúrógép.").ToString(); will return "Árvizturo tukörfurogép." losing information which can not be retrieved. (And I didn't yet mention asian languages where you would loose all characters.)

2019年07月20日59分53秒

What about multibyte characters?

2019年07月21日59分53秒

c.ToByte() is private :S

2019年07月20日59分53秒

AgnelKurian Msdn says "This method returns an unsigned byte value that represents the numeric code of the Char object passed to it. In the .NET Framework, a Char object is a 16-bit value. This means that the method is suitable for returning the numeric codes of characters in the ASCII character range or in the Unicode C0 Controls and Basic Latin, and C1 Controls and Latin-1 Supplement ranges, from U+0000 to U+00FF."

2019年07月20日59分53秒

It's hardly more faster, let alone most fastest. It's certainly an interesting alternative, but it's essentially the same as Encoding.Default.GetBytes(s) which, by the way, is way faster. Quick testing suggests that Encoding.Default.GetBytes(s) performs at least 79% faster. YMMV.

2019年07月20日59分53秒

Try it with a €. This code will not crash, but will return a wrong result (which is even worse). Try casting to a short instead of byte to see the difference.

2019年07月20日59分53秒