Instant Voice Communication

Preface

Recently, we are doing the function of instant voice chat, which is recorded in this blog. The client uses rtmp protocol to push and pull, and the server uses github open source red5. At present, when the network is stable, the delay is 200-500 Ms.

RTMP/RTSP Protocol Description

RTMP only supports TCP and RTSP support TCP and UDP. For the push end, whether using RTMP or RTSP, TCP protocol is needed to ensure the correctness of the source data. If RTSP is used at the pull end, UDP protocol can be used, and UDP protocol is also recommended.

Using RTMP as Drawing End

When the network state changes, it may cause the voice of the pull end to be too late to consume, so that the buffer data is too much, which makes the delay very serious. If it is used as instant messaging, this problem must be solved, even if there is no instant request, too long delay is also a problem. So we need to determine the size of the cache, calculate the time needed to consume, and disconnect and reconnect when it exceeds expectations according to our own needs. RTMP only supports TCP protocol. If it is reconnected continuously, the data will not be lost and must be consumed. Using RTSP UDP or RTMP TCP, and then automatically reconnect, depending on the situation, each has its advantages and disadvantages.

sound recording

Initialize AudioRecord

int bufferSize = AudioRecord.getMinBufferSize(sampleRate,AudioFormat.CHANNEL_IN_MONO,AudioFormat.ENCODING_PCM_16BIT);
audioRecord = new AudioRecord(MediaRecorder.AudioSource.VOICE_COMMUNICATION, sampleRate,
AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT, bufferSize);

start recording

audioRecord.startRecording();

Getting Audio Stream

int bufferReadResult;
byte[] data = new byte[bufferSize];
while (isRecording() && (bufferReadResult = audioRecord.read(data, 0, bufferSize)) > 0) {
    listener.onAudioRecorded(data, bufferReadResult);
}
boolean isRecording() {
    return audioRecord != null
        && audioRecord.getRecordingState() == AudioRecord.RECORDSTATE_RECORDING;
}

Stop recording

audioRecord.release();

Audio parameters

As you can see from the initialization of AudioRecord, there are four concepts we need to master:
Sample Rate
AudioFormat.CHANNEL_IN_MONO
AudioFormat.ENCODING_PCM_16BIT
Source type (MediaRecorder.AudioSource.VOICE_COMMUNICATION).

collection

sampling rate

private int sampleRate = 44100;

The higher the sampling frequency, the more real and natural the sound recovery will be. On today's mainstream acquisition cards, the sampling frequency is generally divided into five levels: 1025 Hz, 22050 Hz, 24000 Hz, 44100 Hz and 48000 Hz. 11025 Hz can achieve the sound quality of AM AM broadcasting, while 22050 Hz and 24000 HZ can achieve the sound quality of FM FM broadcasting, 44100 Hz is the theoretical CD sound quality limit, and 48000 Hz is more accurate. Some.

Vocal tract

There are often monophonic and stereophonic voices. The monophonic voice can only be produced by one speaker (some can be processed into two speakers to output the same channel). Stereophonic voice can make both speakers sound (generally there is a division of work between left and right channels). It can feel the spatial effect better. Of course, there are more channels.

CHANNEL_IN_DEFAULT = 1;
CHANNEL_IN_LEFT = 0x4;
CHANNEL_IN_RIGHT = 0x8;
CHANNEL_IN_FRONT = 0x10;
CHANNEL_IN_BACK = 0x20;
CHANNEL_IN_LEFT_PROCESSED = 0x40;
CHANNEL_IN_RIGHT_PROCESSED = 0x80;
CHANNEL_IN_FRONT_PROCESSED = 0x100;
CHANNEL_IN_BACK_PROCESSED = 0x200;
CHANNEL_IN_PRESSURE = 0x400;
CHANNEL_IN_X_AXIS = 0x800;
CHANNEL_IN_Y_AXIS = 0x1000;
CHANNEL_IN_Z_AXIS = 0x2000;
CHANNEL_IN_VOICE_UPLINK = 0x4000;
CHANNEL_IN_VOICE_DNLINK = 0x8000;
CHANNEL_IN_MONO = CHANNEL_IN_FRONT;
CHANNEL_IN_STEREO = (CHANNEL_IN_LEFT | CHANNEL_IN_RIGHT);
CHANNEL_IN_FRONT_BACK = CHANNEL_IN_FRONT | CHANNEL_IN_BACK;

Generally, CHANNEL_IN_MONO is used as mono channel and CHANNEL_IN_STEREO as dual channel.

Sampling digit

Sampling value or sampling value (that is, to quantify the sample amplitude). It is a parameter used to measure the change of sound fluctuation, or the resolution of sound card. The bigger the value, the higher the resolution and the finer the sound.  
Each sampling data records the amplitude, and the sampling accuracy depends on the size of sampling bits:
One byte (that is, 8bit) can only record 256 numbers, that is, only 256 levels of amplitude can be divided;
Two bytes (16 bits) can be as small as 65536, which is the CD standard.

ENCODING_INVALID = 0;
ENCODING_DEFAULT = 1;

ENCODING_PCM_16BIT = 2;
ENCODING_PCM_8BIT = 3;
ENCODING_PCM_FLOAT = 4;

ENCODING_AC3 = 5;
ENCODING_E_AC3 = 6;

ENCODING_DTS = 7;
ENCODING_DTS_HD = 8;

ENCODING_MP3 = 9;

ENCODING_AAC_LC = 10;
ENCODING_AAC_HE_V1 = 11;
ENCODING_AAC_HE_V2 = 12;

Usually, the 16BIT of PCM is used as the sampling digit, which can meet the basic needs.

Audio Source Type

This is mainly for mobile phones. According to the scenario used by the application, the main mobile terminals listen to a variety of audio source types for selection. Of course, users can also use pure microphones, without any processing of data sources. Mobile phone provides a variety of audio source types. For noise reduction, we need to know the microphone of mobile phone first.

* Microphone Background*

In the early years, mobile phones were all a microphone, usually near the charging port of the lower end of our mobile phones, but when a microphone is talking, it is easy to have noise, which affects the quality of our calls.  
But in recent years, due to the rapid rise and competition of the electronics industry, mobile phone manufacturers have tried their best to make high-quality mobile phones. The microphone has been added to two. You can see if there is a small hole on the top or top of your mobile phone. Then this hole is the second wheat. The location of the clockwork is usually the microphone under the mobile phone when we call. When we turn on the hands-free or video call, we use the microphone above. After that, the other party can't hear, but don't talk to the microphone below foolishly.  
There are also three microphones made by the iphone mobile phone in order to improve the quality of the call. The third one is usually near the flash of the rear camera, generally used for recording. So many microphones have their own functions, which play a very important role in reducing the noise of the call.  
Usually the keyhole at the bottom of the screen is the main microphone, and the microphone at the top of the screen is the noise-reducing microphone. Because the main microphone is mainly used to pick up the voice of the call, it is close to the mouth of the human body, while the noise-reducing microphone needs to keep a certain distance from the main microphone, usually on the top of the screen, away from the mouth position. When talking, the distance between the human voice source mouth and the main microphone and the noise-reducing microphone is different, which results in different human voice intensity (about 6 dB difference) between the two microphones. On the other hand, the ambient noise intensity received by the two microphones is approximately the same. The signal of noise-reducing microphone can counteract the influence of environmental noise and ensure the quality of the call by processing the sound phase inversion inside the mobile phone. Generally, the processing of double MIC denoising technology is still bare even in the midst of a busy market. Conversely, without this technology, the voice transmitted will be very noisy and the quality of the call will naturally be greatly reduced.

* Android phones offer the following sound source types for selection (what's the real situation, and whether the handsets from different manufacturers are supported, as well as their hardware particularities and processing methods):*

1, DEFAULT

Default audio source

2, MIC

Mobile Phone Default Audio Source

3, VOICE_UPLINK

Voice Call Uplink (TX) Audio Source. (Privileges are required, and only the system app gets them)

4, VOICE_UPLINK

Voice call downlink (RX) audio source. (Privileges are required, and only the system app gets them)

5, VOICE_CALL

Uplink + Downlink Audio Source for Voice Call (Requires permissions, and only accessed by System app)

6, CAMCORDER

The microphone audio source can be tuned to video recording in the same direction as the camera (if available).

7, VOICE_RECOGNITION

Microphone audio source tuned for speech recognition.

8, VOICE_COMMUNICATION

A microphone audio source tuned for voice communications such as VoIP. For example, it will use echo cancellation or automatic gain control (if available).

9, REMOTE_SUBMIX

Subtext audio source for remote presentation of audio streams. Applications can use this audio source to capture mixing of audio streams, which should be transmitted to remote receivers, such as WiFi displays. When the recording is active, these audio streams are redirected to remote submission rather than playing on device speakers or headphones. (Privileges are required, and only the system app gets them)

10, UNPROCESSED

A microphone audio source tuned for unprocessed sound (if available) behaves similarly to {@Link#DEFAULT}

11, RADIO_TUNER

An audio source for capturing the output of a radio tuner. (Hide methods, only for underlying calls)

12, HOTWORD

Audio source for preemptive, low priority software hotword detection provides the same gain and preprocessing tuning as {@link speech recognition}. Applications should use this audio source when they want to use it, always turn on software Hotword detection, and gracefully yield to any other application. Maybe you want to read it from the microphone.  
(Hidden method, only for low-level calls, and requires permissions)

Summary: Only DEFAULT, MIC, CAMCORDER, VOICE_RECOGNITION and VOICE_COMMUNICATION can be used for mobile phones. MIC is usually used for development defaults. CAMCORDER can be used for video recording, VOICE_RECOGNITION for speech recognition and VOICE_COMMUNICATION for automatic noise reduction. In addition, when the phone has two microphones, VOICE_COMMUNICATION will use the microphone above and MIC will use the microphone below.

Shielded speakers for recording

AudioManager audioManager = (AudioManager) context.getSystemService(Context.AUDIO_SERVICE);
audioManager.setSpeakerphoneOn(false);
// You can use the following method to determine whether to turn on the speaker or not
boolean isOpenSpeaker = audioManager.isSpeakerphoneOn();

Setting Play Sound Type

In many applications and games, it is clearly playing media music. When adjusting the volume, it is found that the tone is the voice of the call. How to do this? Look at the following:

// From blog https://www.cnblogs.com/loveflycforever/p/4881945.html
// Get the AudioManager instance object
AudioManager audioManage = (AudioManager) context.getSystemService(Context.AUDIO_SERVICE);
// Get the maximum volume and current volume, parameters: STREAM_VOICE_CALL (call), STREAM_SYSTEM (system sound), STREAM_RING (ringtone), STREAM_MUSIC (music) and STREAM_ALARM (alarm bell)
int max = audioManager.getStreamMaxVolume(int streamType);
int current = audioManager.getStreamVolume(int streamType);
// Get the current ring mode and return values: RINGER_MODE_NORMAL (normal), RINGER_MODE_SILENT (silent) or RINGER_MODE_VIBRATE (vibration)
int rMode = audioManager.getRingerMode();
// Get the current audio mode, return values: MODE_NORMAL (normal), MODE_RINGTONE (ringtone), MODE_IN_CALL (call) or MODE_IN_COMMUNICATION (call)
int mode = audioManager.getMode();

// Set the volume size, the first parameter: STREAM_VOICE_CALL (call), STREAM_SYSTEM (system voice), STREAM_RING (ringtone), STREAM_MUSIC (music) and STREAM_ALARM (alarm bell); the second parameter: volume value, the range of value is 0-7; the third parameter: optional flag bit, used to display volume adjustment. UI (Audio Manager. FLAG_SHOW_UI).
audioManager.setStreamVolume(int streamType, int index, int flags);
// Set the ring tone mode, parameters: RINGER_MODE_NORMAL (normal), RINGER_MODE_SILENT (silent) or RINGER_MODE_VIBRATE (vibration)
audioManager.getRingerMode(int ringerMode);
// Set the audio mode, parameters: MODE_NORMAL (normal), MODE_RINGTONE (ringtone), MODE_IN_CALL (call) or MODE_IN_COMMUNICATION (call)
audioManager.setMode(int mode);
// Set mute / cancel mute, second parameter: request mute state, true (mute) false (cancel mute)
audioManager.setStreamMute (int streamType, boolean state);

// To adjust the volume of the mobile phone, the second parameter is to adjust the direction of the volume. ADJUST_LOWER (lower), ADJUST_RAISE (higher), and ADJUST_SAME (unchanged).
audioManager.adjustStreamVolume(int streamType, int direction, int flags);

The above explains the setting and acquisition of various sound types.  
But there are still some things to be noticed in use:
1. setMode(), ordinary app can only set MODE_NORMAL, MODE_RINGTONE, MODE_IN_COMMUNICATION three, and need to add android.permission.MODIFY_AUDIO_SETTINGS permission in Android Manifest. MODE_IN_CALL can only be set by system app.  
2. setMode() is set to MODE_NORMAL, MODE_RINGTONE, which regulates the volume of the media; after setting MODE_IN_COMMUNICATION, the volume of the call is adjusted.

Tags: Mobile Android network github

Posted on Mon, 16 Sep 2019 23:15:10 -0700 by fragger