Principles of character set GBK and UTF8 and UNICODE

First, from a macro perspective

At first, the operating system only supports a-z A-Z 0-9 and some simple symbols, because they are all English systems.

Then computers developed in various countries. Then we need to support the languages of all countries. All countries have compiled their own character sets, Chinese characters are GB2312 and GBK. Other countries have their own codes

Then in order to unify the codes of different countries, UNICODE appears, which can be supported by other languages through corresponding tables. For example, the UNICODE code corresponding to the word "Han" is ‭ 66111 ‬, the hexadecimal code is ‭ 6c49 ‬, and the hexadecimal code in GBK is BA BA. Then you need to match the table of BABA from 6c49. A corresponding table is also required for conversion to other character sets. This corresponding table may be produced by various character sets according to the table published by undo (I guess)

Due to the large storage space of UNICODE, UTF-8 with variable length storage appeared later. Unlike other character sets, UNICODE and UTF-8 conversions do not require a corresponding table. It can be transformed by algorithm. Specific Baidu check, the article is many. Here is a brief introduction:

UTF-8 records in variable length mode, that is, it depends on the corresponding UNICODE code to decide how many bytes to use

1 byte 0xxxxxx
2 bytes 110xxxxx 10xxxxx
3 bytes 1110xxxx 10xxxxx 10xxxxx
4 bytes 11110xxx 10xxxxx 10xxxxx 10xxxxx
5 bytes 111110xx 10xxxxx 10xxxxx 10xxxxx 10xxxxx 10xxxxx
6 bytes 1111110x 10xxxxx 10xxxxx 10xxxxx 10xxxxx 10xxxxx

X is the number of bits that can be filled. 1 can fill in 7 bits by itself, 2 bytes can fill in 11 bits, 3 words can save energy and fill in 16 bits

For example, if the system reads the next byte to 0, it means that it can read a byte and convert it to UNICODE

If the read high bit is 110, then read two bytes and convert them to UNICODE, and so on

The UNICODE code of the word "Han" is 6C49 binary ‭ 01101100 01001001 ‬ it needs 16 bits corresponding to the table above, that is to say, it should be represented by 3 bytes

Therefore, UNICODE is needed as a transit in the conversion process of GBK and UTF-8, and there is a corresponding relationship at the bottom of the operating system.

Fill 0110 110001001 from the left into the box below

1110▢▢▢▢ 10▢▢▢▢▢▢ 10▢▢▢▢▢▢

11100110 10110001 10001001

Convert to hex ‭ E6B189 ‬

java code test

public class Test1 {

	public static String bytesToHexString(byte[] bArr) {
        StringBuffer sb = new StringBuffer(bArr.length);
        String sTmp;

        for (int i = 0; i < bArr.length; i++) {
            sTmp = Integer.toHexString(0xFF & bArr[i]);
            if (sTmp.length() < 2)
                sb.append(0);
            sb.append(sTmp.toUpperCase()+" ");
        }

        return sb.toString();
    }
	
	public static void main(String[] arg)
	{
		try {
			System.out.println(Test1.bytesToHexString("Chinese".getBytes("UTF-8")));
		} catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

//Output: E6 B1 89 

 

The following describes the specific application

First of all, it should be noted that no specific character set of String is UNICODE

First, turn UTF-8 to GBK and then back

public class Test1 {

	public static String bytesToHexString(byte[] bArr) {
        StringBuffer sb = new StringBuffer(bArr.length);
        String sTmp;

        for (int i = 0; i < bArr.length; i++) {
            sTmp = Integer.toHexString(0xFF & bArr[i]);
            if (sTmp.length() < 2)
                sb.append(0);
            sb.append(sTmp.toUpperCase()+" ");
        }

        return sb.toString();
    }
	
	
	
	public static void main(String[] arg)
	{
				
		try {
			String s = "test";//UNICODE code
			byte[] b = s.getBytes("UTF-8");//Convert to UTF-8 encoding length 6 bytes E6 B5 8B E8 AF 95
			System.out.println(Test1.bytesToHexString(b));//Output E6 B5 8B E8 AF 95
			String sU2F = new String(b,"GBK");//The UTF8 code originally tells the system that this is GBK code. The system will combine two (GBK is double byte memory) (E6 B5) (8B E8) (AF 95)  
			System.out.println(sU2F);//Display disorderly code, because the corresponding error and the previous two words become three words display: Test
			
			String sF2U = new String(sU2F.getBytes("GBK"),"UTF-8");//Recreate the String, tell the system that the byte [] is UTF-8 encoding, and the encoded String will be displayed correctly.
			System.out.println(sF2U);//Output test
		} catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		
		
		
		
	}
}
public static void main(String[] arg)
	{
				
		try {
			String s = "Test!";//UNICODE code
			byte[] b = s.getBytes("UTF-8");//Convert to UTF-8 encoding length 9 bytes E6 B5 8B E8 AF 95 E5 95 8A
			System.out.println(Test1.bytesToHexString(b));//Output E6 B5 8B E8 AF 95 E5 95 8A
			String sU2F = new String(b,"GBK");//The UTF8 code originally tells the system that this is GBK code. The system will combine two (GBK is double byte memory) (E6 B5) (8B E8) (AF 95)(E5 95) (8A XX)  
			System.out.println(sU2F);//Display garbled code, because the corresponding error and the previous 3 words into 5 words and also introduced an unknown byte after the display: Test vendor
			String sF2U = new String(sU2F.getBytes("GBK"),"UTF-8");//Recreate the String, tell the system that the byte [] is UTF-8 encoding, and the signature is correct. But finally, because unknown bytes are introduced, there will be problems with the last word
			System.out.println(sF2U);//Output random code: test?
		} catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		
	}

The following describes GBK to UTF-8

public static void main(String[] arg)
	{
				
		try {
			String s = "test";//UNICODE code
			byte[] b = s.getBytes("GBK");//Convert to GBK code length 4 bytes B2 E2 CA D4
			System.out.println(Test1.bytesToHexString(b));//Output B2 E2 CA D4
			String sU2F = new String(b,"UTF-8");//It was originally GBK code that told the system this was UTF-8 code.  
			System.out.println(Test1.bytesToHexString(sU2F.getBytes("UTF-8")));//EF BF BD EF BF BD EF BF BD EF BF BD
			//How did 4 bytes become 12 bytes?
			//Because UTF-8 has coding rules, which have been described above. If it is found that it does not conform to the rules, it will be converted to EF BF BD, a total of 4 bytes is 12 bytes
			
			
			System.out.println(sU2F);//Display jumble, EF BF BD corresponds to UTF-8
			
			//Since the codes at this time do not correspond to each other, forcing conversion to EF BF BD means that the codes have been completely lost, so it is also wrong to convert back to GBK
			//Before, GBK to UTF-8 is corresponding to. Although it is wrong, it is found in the comparison table. So there is no loss of data.
			String sF2U = new String(sU2F.getBytes("UTF-8"),"GBK");
			System.out.println(sF2U);//Output random code: when GBK encoded byte is treated as UTF8, most cases are not corresponding. In other words, it's basically Kunjin 
		} catch (UnsupportedEncodingException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		
	}

 

conclusion

All byte [] must know its own code or it will basically make mistakes, such as reading files to receive network transmission information.

All strings are UNICODE. No, this String is UTF-8 or GBK

If you take byte [] of gbk character set as UTF-8, you will lose data, and there is no way to turn it back

If the byte [] of UTF-8 character set is treated as GBK, it depends on luck. If even Chinese can be converted back, cardinal Chinese will introduce unknown bytes into the last Chinese character, even if it is converted back, it must be garbled

 

 

 

 

Published 17 original articles, won praise 6, visited 50000+
Private letter follow

Tags: encoding Java network

Posted on Fri, 06 Mar 2020 01:29:46 -0800 by brighton