WO2009090281A1

WO2009090281A1 - Method of converting 5.1 sound format to hybrid binaural format

Info

Publication number: WO2009090281A1
Application number: PCT/ES2008/070246
Authority: WO
Inventors: Ivan Portas Arrondo
Original assignee: Auralia Emotive Media Systems, S,L.
Priority date: 2008-01-17
Filing date: 2008-12-30
Publication date: 2009-07-23
Also published as: ES2323563A1; ES2323563B1

Abstract

Method of converting 5.1 sound format to hybrid binaural format, comprising obtaining the signals from the FL, FR, C, SL, SR and LFE channels in 5.1 format which it is desired to convert into hybrid binaural format; auralizing the FL, FR, SL and SR channels in the following positions: FL: elevation from 0° to 30°, azimuth from -10° to -30°; FR: elevation from 0° to 30°, azimuth from +10° to +30°; SL: elevation from 175° to 195°, azimuth from -30° to -60°; SR: elevation from 175° to 195°, azimuth from +30° to +60°, thus obtaining the signals FL1, FR1, SL1 and SR1; modelling the response from the enclosure on the basis of the signals, introducing a reverberation effect; and mixing the signals FL2, FR2, SL2 and SR2 obtained in the previous step with the original LFE and C signals to obtain the two left and right output signals.

Description

SOUND FORMAT CONVERSION PROCEDURE 5.1 A

HYBRID BINAURAL

D E S C R I P C I Ó N

OBJECT OF THE INVENTION

The main object of the present invention is a method for converting sound into 5.1 sound format, usually used for recording and digital sound reproduction of cinematic content, in hybrid binaural format.

BACKGROUND OF THE INVENTION

Currently, the 5.1 format represents the standard for the domestic sound reproduction of cinema. A sound system in 5.1 format is composed of six audio channels where music, voice, sound effects, etc. are mixed in different proportions. Each of the channels corresponds to a speaker, and in turn each of the speakers must be located in a specific location in relation to the user to achieve an optimal sound sensation.

The main speakers (FL and FR in Figure 1) ideally form an equilateral triangle with the user's position (O). In addition, the lines formed by the surround speakers (SL and SR) and the user (O) form an angle of approximately 110 ° with respect to the vertical axis (straight that joins O and C). The LFE (Low Frequency Enhancement) loudspeaker is intended to enhance bass sounds to produce a striking effect on reproduction. Its location is not decisive, since the information it transmits has a frequency spectrum generally less than 100 Hz, which has an omnidirectional nature. That is, you cannot determine where the sound comes from.

A drawback of audio systems based on the 5.1 format is that the user's sound sensation deteriorates rapidly when it is not located in the optimal location with respect to the speakers. The use of headphones allows, however, an optimal positioning of the user at all times, since the sound reproduction systems, being attached to the user's head, do not modify their relative position with respect to their head.

However, the human being is a volumetric sound receiver, that is, it processes the sound that reaches it through, for example, reflections created by the shoulders and torso, or diffractions created by the sound when surrounding the head. Human hearing is by nature binaural, where the result of the entire sound reception process ends in only two channels: right ear and left ear. The term "binaural" refers to the nature of human hearing, because people are able to capture all the spatial sound information through a single pair of ears.

When this phenomenology is not taken into account, the so-called "intracranial sound" is usually produced, such as when listening to traditional stereo sound through headphones. The intracranial sound consists in the sensation that the sound sources are inside the skull of the user, at a point located between the two headphones, so that traditional stereo sound is not an advisable format when trying to represent in a way Realistic three-dimensional sound spaces.

There are fundamentally two ways to achieve binaural reproductions:

The first of these consists in replacing the pair of point receivers that are usually used by volumetric receivers, such as dummies, thereby achieving that the sound that reaches them is processed naturally. In this way a binaural stereo recording is achieved, where all the phenomenology described above is already introduced.

The second is based on performing an auralization procedure. For This usually measures or models the response of a certain receiver (a dummy or a human being, for example) to an impulse signal from a certain point in space (usually a broadband noise emitted from a certain point around the Username). US Patent 2007213990 describes a method for transforming a traditional bacchanal stereo signal into a binaural signal, focusing on the treatment that the input signal must undergo for its preparation to be transformed into three-dimensional sound. Specifically, it is described how to divide the input signal according to different frequency bands so that, once the input signal is divided, auralize each sub-band and finally join them to form the two output channels in binaural format.

DESCRIPTION OF THE INVENTION

The present invention describes a new method for real-time audio auralization in 5.1 format. To achieve an optimal result, each channel is treated and auralized independently, so that it is possible to assign specific acoustic parameters to each of them in order to make the reproduction more realistic and spectacular.

The most important advantages of the process of the invention can be summarized in the following:

Optimum reproduction is achieved in all cases, since, since the headphones are attached to the user, the relative position between the playback system and the user does not vary.

The hybrid model described, which combines the auralization of the FL, FR, SL and SR channels with the original C and LFE monophonic channels allows greater intelligibility of the dialogues, since there is no interference between the front channels and the C channel, as well as a superior immersion due to the constant unconscious referencing made by the brain between the C channel monophonic and the auralized channels.

The readjustment of the proportions of the different types of information, by means of the separation of sources and subsequent remixing, allows to optimize from the beginning the content of the different channels to achieve an optimal result.

The specific virtual placement of the FL and FR channels, as well as the modeling of the specific enclosure, allow a perfect balance with the dialogue channel C, not interfering with its intelligibility and providing the frontal plane with just depth.

The specific virtual placement of the SL and SR channels, as well as the modeling of a different specific enclosure for the channels of the front and rear planes, provide a sensation of striking rear depth, giving the system differentiated planes of sound reproduction, creating This way a highly immersive experience.

The reinforcement of the LFE channel allows to recreate the sensations produced by the serious components in the cinemas, balancing the reproduction system.

In this document, the term "auralize" refers to the processing of the different channels to get the user to have the impression that they come from specific places in space, thus achieving optimized spectacularity and intelligibility.

Similarly, the term "channel" refers to the signal of each of the speakers that make up the 5.1 sound format or the hybrid binaural sound format. Thus, we will talk about the FL, FR, C, SL, SR or LFE channels, which are the input channels in 5.1 format and the L and R channels, which are the output channels in binaural format. The letters "L" and "R" will be used to distinguish between the positions of the channels located to the left (left, in English) and right (right, in English) of the user. The terms "frontal plane" and "rear plane" will also be used to refer to the position of the channels in front of the user or behind the user, as well as "right side plane" or "left side plane" to refer to the position of the channels to the sides of the user.

On the other hand, the term "source" refers to a signal that contains sounds from a single physical process, that is, the sources will be, in general, music, voice and effects.

The term "hybrid binaural" is also defined as a sound format that mixes auralized channels with non-auralized or monophonic channels. Specifically, the present invention mixes the auralized channels FL, FR, SL and SR with the non-auralized channels C and LFE.

In accordance with one aspect of the present invention, a method of converting from sound format 5.1 to hybrid binaural is described, characterized in that it comprises the following operations:

1) Obtain the signals of the FL, FR, C, SL, SR and LFE channels of the 5.1 format that you want to convert into hybrid binaural format. The information contained in these signals is usually a mixture of several sources, where:

FL: mainly contains music, and to a lesser extent voice and effects.

FR: contains mainly music, and to a lesser extent voice and effects. C: contains mainly voice, and to a lesser extent music and effects. SL: contains mainly effects, and to a lesser extent music. SR: contains mainly effects, and to a lesser extent music. LFE: contains only serious.

2) Auralize the FL, FR, SL and SR channels in the following positions: FL: elevation of 0 ^or 30 °; azimuth from -10 ^or -30 °.

FR: elevation of 0 ^or 30 °; azimuth from +10 to + 30 °.

SL: elevation from 175 ° to 195 °; azimuth from -30 ° to -60 °. SR: elevation from 175 ° to 195 °; azimuth from + 30 ° to + 60 °.

resulting in the signals FLi, FRi, SLi and SRi.

We will say that "auralizing" a channel in a certain position means virtually locating that channel so that the reproduction of the resulting signals, one for the right channel and one for the left channel, through headphones produce the sensation in the user of that the sounds of that channel come from that particular position of space.

In other words, auralizing is a process by which a channel lacking usually monophonic spatial information, as in this case, that is, anechoic or dry, is processed by a procedure called convolution, with the impulse response (response in time and frequency at a given acoustic stimulus from a certain point in space) of a particular listener.

However, due to physical differences between different users (size, distance between ears, etc.), not all of them respond equally to the new FLi, FRi, SLi and SRi channels.

To know the response of each type of user, the response of a certain receiver (a dummy or a human being for example) to a pulse signal from a certain point in space (usually broadband noise emitted) is modeled or measured from a certain point around the user). This response to the user's impulse is later used to process a monophonic source (without spatial information) through a convolution process, thus achieving the effect of listening to said source located at the point where the impulse has been emitted.

The inventors have discovered that placing virtually the FL, FR, SL and SR channels within the angular ranges described above gives all users a feeling of optimal spectacularity.

The reason that the angular ranges of the front speakers (FL and FR) are not very large is to avoid the loss of intelligibility of the dialogue channel (C) due to an excessive stereo image of the music, that is, that the energy of the FL channel goes almost completely to L and the energy of FR goes almost completely to R, and avoid the arrival of a large amount of energy to the lateral planes, near the ears that interfere with the location of the rear plane channels (SL and SR).

The dialogue channel (C) is not processed in the processing operation of the signals of the FL, FR, SL and SR channels, since maintaining it as a source provides two great advantages to the final output of the procedure.

The first one is to gain in intelligibility with respect to the input format, since by keeping this channel intact and auralizing those of the frontal (FL and FR) and rear (SL and SR) planes, the dialogues (C) are highlighted in Ia central position, reducing hearing fatigue for follow-up.

The second advantage lies in the fact that it constitutes an auditory reference point for the brain, since maintaining its intracranial nature makes its combination with the auralized channels ideal. In this way, the brain constantly compares the position of this channel with the auralized ones, making the user's auditory experience much more spectacular.

The LFE channel is also not processed in this procedure operation due to the non-directional nature of the frequencies it contains, that is, it gives the sensation of being heard in all positions. This feature makes that the speakers intended for the reproduction of this channel can be placed practically anywhere in the enclosure.

3) Model independent enclosure responses for the front and rear planes.

The front (FLi, FRi) and rear (SU, SRi) plane channels are processed independently using two impulse responses from different optimized enclosures. The separate processing of the front and rear channels provides the advantage of using two different virtual enclosures, giving more depth only to the rear channels, which are the ones with the most spectacular effects. Excessive depth in the front channels, however, would make the intelligibility of the dialogues difficult.

In accordance with preferred embodiments of the present invention, the reverberation introduced in the Fl_i and FRi channels is within the range of 0.5 seconds to 1 second, and the reverberation introduced in the SU and SRi channels is within the range of 1 second to 3.5 seconds

Thus, after the operation of modeling the response of the enclosure, the signals from the front plane FL ₂ and FR ₂ are obtained as output, and the signals from the rear plane SL ₂ and SR ₂

4) Mix the signals obtained in the previous operation together with the original LFE and C signals to obtain the output signals of the left channel and the right channel (L and R).

In accordance with a preferred embodiment of the present invention, the conversion procedure of sound format 5.1 to hybrid binaural, comprises, prior to the final mixing operation, compressing the LFE channel signal, obtaining an LFE 'signal.

Another preferred embodiment of the invention comprises, prior to Ia auralization operation, the operations of:

a) Separate the signals from the FL, FR, C, SL, SR channels in the sources that comprise L music, R music, voice and front effects, rear effects L and rear effects R. The separation is performed using an algorithm of independent component analysis. This analysis makes a comparison of the different inputs (channels) that contain redundant information in different proportions. Starting from the theory that several signals can be considered independent if they come from different physical processes, it is possible to isolate the different components, which in this case are voice, music and effects.

b) Mix the sources music L, music R, voice and front effects, rear effects L and rear effects R to obtain the signals that will constitute the input to the subsequent auralization operation of the channels. This mixing operation reconstructs the signals FL, FR, C, SL and SR with the optimal proportions of the sources that were separated in the previous operation.

In accordance with a preferred embodiment of the present invention, the mixing of the sources L music, R music, voice and front effects, rear effects L and rear effects R to obtain the channels is performed according to the following percentage ranges:

FL. :: 70-90% L music, 30-10% voice and front effects

FR *: 70-90% R music, 30-10% voice and front effects

C: 70-90% voice and front effects, 30-10% music L and R

SL: 70-90% L back effects, 30-10% L music

SR: 7700--9900 %% R back effects, 30-10% R music

The objective of these two optional operations is to ensure that each channel in the auralization process contains the adequate proportion of the different components, since the original 5.1 mix was optimized for reproduction through 6 physical speakers, a completely different scheme to A pair of headphones. When reproducing in headphones, the redundant information characteristic of quadraphonic systems such as 5.1 hinders the perception of spatial realism, and that is why this readjustment is necessary.

The LFE bass channel is already an independent component in itself, and therefore its information is not redundant in the other channels. For this reason it is not included in the optional initial separation and mixing operations.

According to another aspect of the invention, this also extends to computer programs, in particular computer programs contained in a carrier, adapted to carry out the operations of the described procedure. The program can be in the form of a source code, object code or an intermediate code between the source code and the object code, as a partially compiled form, or in any other suitable way to implement the operations of the invention.

The carrier can be any device or entity capable of transporting the program. For example, the carrier can comprise a storage medium, such as a ROM, a CD ROM or any other magnetic storage medium, for example a floppy disk or a hard disk. In addition, the carrier can be a transmission carrier, such as an electrical or optical signal that can be communicated through electric, optical, radio or any other way.

Alternatively, the carrier can be an integrated circuit in which the program is stored, the circuit being adapted to carry out the operations of the procedure. In particular, it could be an ASIC, an FPGA, a DSP, a microprocessor or a microcontroller.

DESCRIPTION OF THE DRAWINGS To complement the description that is being made and in order to help a better understanding of the characteristics of the invention, according to a preferred example of practical implementation thereof, a set of drawings is attached as an integral part of said description. where, for the purposes of illustration and not limitation, the following has been represented:

Figure 1.- Shows a view of the location of the physical speakers of a cinema in a 5.1 sound format.

Figure 2.- Shows an explanatory scheme of the position of the elevation angles (α) and azimuth (β).

Figure 3.- Shows a general scheme of the operations of the process according to the present invention.

PREFERRED EMBODIMENT OF THE INVENTION

It is based on the original sound of a movie in 5.1 format that you want to convert into hybrid binaural, which in this case is recorded on a disc of type

DVD. Figure 1 shows the position of the speakers of the channels in a movie theater in relation to the position in which the user must be located for an optimum sound experience.

In this example, the procedure is carried out by a computer that, first, as shown in Figure 3, obtains from the DVD the signals of the original channels in 5.1 format (FL, FR, C, SL, SR, LFE ). The LFE channel is separated to be processed in parallel independently, suffering only a compression that results in the LFE 'signal.

In this example, a selector (S) is provided that allows the user select or not the optional operations of extracting the sources from the original channels and remixing them according to new proportions to enhance the spectacular nature of the film. For this, the sources (L music, R music, voice and front effects, rear effects L and rear effects R) are separated, for example using the source separation algorithm by independent component analysis 'FastICA', developed by HUT (Helsinki University of Technology), to re-mix them according to new optimized proportions. In this example we will assume that the film is action, which implies the existence of a series of sound characteristics, such as explosions, shots, engine noise, etc. In order to achieve the greatest possible spectacularity in this type of films, the following optimal mixing ratios have been determined:

FL ': 80% music L + 20% voice and front effects FR': 80% music R + 20% voice and front effects

C: 80% voice and front effects + 20% music L and R

SL ': 80% rear effects L + 20% music L

SR ': 80% rear effects R + 20% R music

Once the sources in the channels are mixed in this optimized way, the dialogue channel (C) is separated from the rest, the channels FL ', FR', SL 'and SR' are each amalized in an optimal geometric situation to enhance the spectacular user sound experience. In this case, it has been considered that the listener has the characteristics of a standard user based on the impulse responses of a Kemar dummy.

Below are the optimal positions of the channels, described through the elevation angle (α) and the azimuth angle (β) that form with the listener:

FL ': elevation 15 °; azimuth -20 °

FR ': elevation 15 °; azimuth 20 ° SL ': 180 ° elevation; azimuth -40 ° SR ': 180 ° elevation; azimuth 40 °

Figure 2 shows the reference of the location of the elevation and azimuth angles, respectively α and β. After the auralization operation, the signals FL'i, FR'i _, SL'i and SRV are obtained Next, the signals FL'i and FR'i are processed with the impulse response of an enclosure similar to a room of cinema, with a reverberation time (T _r ) of approximately 0.5 seconds; and the SL'i and SR'i signals with the impulse response of another enclosure similar to a different movie theater, with a reverberation time of approximately 2 seconds.

Finally, the channels obtained in the previous operation, FL'2, FR'2, SL'2 and SR'2 are mixed with the LFE 'and C channels to obtain only two signals in hybrid binaural format corresponding to the L and R channels of headphones.

Claims

1. Conversion procedure from sound format 5.1 to hybrid binaural, characterized in that it comprises the following operations

obtain the signals of the FL, FR, C, SL, SR and LFE channels of the 5.1 format that you want to convert into a hybrid binaural format;

auralize the FL, FR, SL and SR channels in the following positions:

FL: elevation of 0 ^or 30 °; azimuth from -10 ^or -30 °.

FR: elevation of 0 ^or 30 °; azimuth from +10 to + 30 °.

SL: elevation from 175 ° to 195 °; azimuth from -30 ° to -60 °.

SR: elevation from 175 ° to 195 °; azimuth from + 30 ° to + 60 °,

resulting in the signals FLi, FRi, SLi and SRi;

independently process the signals from the front plane (FLi and FRi) and those from the back plane (SLi and SRi), using the impulse responses of two different virtual enclosures, each optimized for said planes, resulting in the results FL ₂ , FR ₂ , SL ₂ and SR ₂ signals;

Mix the signals FL ₂ , FR ₂ , SL ₂ and SR ₂ obtained in the previous operation together with the original signals LFE and C to obtain the two left and right output signals.

2. Conversion procedure of sound format 5.1 to hybrid binaural according to the preceding claim, characterized in that the impulse responses of the virtual enclosures used for the processing of the frontal and rear plane, comprise reverberation times of between 0.5 s and 1 s for the first, and between 1 s and 3.5 s for the second.

3. Conversion procedure from sound format 5.1 to hybrid binaural according to any of the preceding claims, characterized in that it comprises, prior to the final mixing operation, a compression of the LFE channel.

4. Conversion procedure from sound format 5.1 to hybrid binaural according to any of the preceding claims, characterized in that before the auralization operation comprises the operations of:

separate the signals from the FL, FR, C, SL, SR channels in the sources that compose them: L music, R music, voice and front effects, rear effects L and rear effects R;

remix the estimated sources in optimized proportions for subsequent processes, rebuilding the FL, FR, C, SL and SR channels.

5. Conversion procedure of sound format 5.1 to hybrid binaural according to the preceding claim, characterized in that the operation of remixing of the sources music L, music R, voice and front effects, rear effects L and rear effects R is performed according with the following percentage ranges:

FL. :: 70-90% L music, 30-10% voice and front effects

FR *: 70-90% R music, 30-10% voice and front effects

C: 70-90% voice and front effects, 30-10% music L and R

SL: 70-90% L back effects, 30-10% L music

SR: 7700--9900 %% R back effects, 30-10% R music

6. Conversion procedure of sound format 5.1 to hybrid binaural according to the preceding claim, characterized in that it is carried out by a device among those of the following list: an ASIC, an FPGA, a DSP, a microprocessor and a microcontroller .

7. Computer program comprising program instructions that cause a computer to carry out the operations of the method according to any of the preceding claims.

8. Computer program according to claim 7, characterized in that it is stored in storage media.

9. Computer program according to claim 7, characterized in that it is transmitted through a carrier signal.