Audio techniques

I've been thinking about sound. It's possible to add sound together, right? So why is it so seldom seen that you subtract sound? I know about noise cancellation headphones, but I don't see subtracting sound used very much elsewhere.

And another more complex idea: You know how movies play in multiple languages? Technically, if you have the sound of the movie in each language, you should theoretically be able to eliminate the vocals completely and isolate just the ambiance and music. I don't normally see this either, though.

I'm sure this stuff is used quite commonly somewhere, but why is it so seldom heard of? Everyone I talk to thinks I'm misunderstanding some fundamental concept.
It seems obvious, but it's actually much more difficult than it sounds (no pun intended). The human ear and brain are magically capable of separating out different harmonics. We can hear two distinctly different sounds like a "crash" of a cymbal and the "jurrr" of an electric guitar chord and they clearly sound different to us, even when heard simultanously.

But that's because our brains are magic. Doing the same thing (separating two sounds) is extremely difficult to do in software. When you have just a single complex sound wave, it is very difficult to know which harmonics belong to which "sound". Especially since you don't even know how many "sounds" are in the wave to begin with.

For example that "jurrr" electric guitar sound probably has dozens of harmonics forming it. Each of varying frequency and amplitude. You don't even want to know how many the cymbal crash has.


"Wave subtraction" really only works if you have the exact copy of the wave you want to remove. If the wave is even just a little off, subtraction won't work. You might only muffle the sound, or possibly even amplify or echo it depending on how far off you are.



Case in point: have you ever seen any of those karaoke programs which try to remove vocals from songs? They all suck. Conceptually it's simple: find the vocals and subtract them out. But finding the vocals is crazy hard.


EDIT:

For an extremely contrived example...

Say you want to mix two strings of samples. One guitar and one cymbal:

guitar = 5, 3, 8
cymbal = 2, 9, 0

mix = 7, 12, 8 (the sum of both strings)

Simple enough, right?

Now try to do that in reverse. Here's another arbitrary mix:

mix = 15, 3, 7

Now try to subtract the guitar samples from that to leave only the cymbal samples. (Read: you can't without having an exact copy of the guitar wave)


And another more complex idea: You know how movies play in multiple languages? Technically, if you have the sound of the movie in each language, you should theoretically be able to eliminate the vocals completely and isolate just the ambiance and music. I don't normally see this either, though.


It wouldn't make sense to do this.

Let's say you have a dual audio film. English and Spanish.

The way most movies do it is they have 2 audio tracks: one on each language. Each audio track also has bgm and sound effects and all that other stuff.

What you're proposing would require 3 audio tracks:

1) English audio only
2) Spanish audio only
3) Bgm and sound effects

It would also require that at least two of these audio tracks be mixed during playback, which is extra (unnecessary) work.

It would be better/simpler to just have the BGM as its own audio track without having to mix them:

1) English Audio (with bgm + sound)
2) Spanish (with bgm)
3) Just Bgm


But not a whole lot of people want just bgm, so it's not worth it to put that on the movie.



--------------------

EDIT AGAIN:

Did I misunderstand the movie example? I read it again and it sounds like the movie would just have the normal 2 audio tracks (each language with combined bgm) and you could use those waves to "cancel out" the bgm?

That doesn't work either. This can be illustrated with some algebra:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let B = bgm only
let E = English + bgm
let P = Spanish + bgm
let e = english only
let p = spanish only

// we know the following:
e = E - B
p = P - B
B = E - e
B = P - p


// so given E and P, how can we find B?
// hint:  you can't
//
// we can try cancelling out the BGM to isolate just the vocals:
nobgm = E - P

e ?= nobgm - E  // you'd think vocals only - vocals+bgm would give you the vocals
   // but this doesn't work because we've just added the spanish vocals (inverted):

nobgm - E =   (E - P) - E   =   -P
-P != e
Last edited on
What I meant with the movie example is that if you have the movie sound in enough languages (generally 3 where I live; English, Spanish, and French), you should be able to find the similarities between them to eliminate the vocals. Obviously it wouldn't be perfect, but it should in theory be possible.
Yeah I see now. But it's not as simple as it sounds. As shown previously, a simple subtraction is not viable. And "finding similarities" between sound waves is surprisingly difficult.
Technically, light is a wave like sound is, plus or minus some technicalities. It's easier with images because of the way they are represented. Maybe we just need to represent sound in a different way? it just seems so strange that sound is so difficult to deal with compared to light.
Light isn't really any easier. Image recognition is just as complex a field as audio DSP.

EDIT:

For a comparison...

You can alpha blend 2 images together easily enough... But given only the final result image it's difficult/impossible to extract the original 2 images.
Last edited on
Man Disch, that got really deep.

I've thought about this kind of thing a lot as well. It seems so simple in my head.
For instance, I started digging into OpenCV about a month ago, and was trying to get a program to detect certain feature on my computer screen from a screenshot. When I got down to trying to describe how I knew that the red button in the top left of my screen had meaning, it all started to fall apart. I know that the red circle is distinct from other bits of light i'm seeing, but I don't know why. I know that it has meaning, but I can't tell you how I know that. So yeah it gets pretty complex.
I spent about 7 hours understanding the meaning behind template matching algorithms (which are pretty simple in hindsight), which pretty much shifts a template image across another image in every way possible and returns result coefficients in a matrix the size of the master image. You can use that matrix to find the position for the best "match" for the template on the screen.
That's cool and all, but it's pretty limited. If I have a template of a red button, all I can do is find a red button (and other things that look similar to a red button). And sometimes I can't even find that red button if there are too many other things that look similar to it.

I think a lot of the same things could apply to audio, but it will be equally as complex. As Disch said, our brains are magic.

I also think that what you're saying has some sense to it, L B. If you have an exact copy of the voices for an entire movie, theoretically it would make sense that if you invert that track and play it simultaneously with the movie then you wouldn't hear any of the voices.
Waves are cool as hell in that aspect.
And it's also why I can't sleep without a ton of white noise in the background. It cancels out a lot of other sounds that might catch my attention and leaves me with a constant. I sleep like an absolute baby ever since I started doing that.
Advances in computer vision/hearing are somethings I'd really like to see in my lifetime.

On a side note, I wonder if it would be possible (it has to have been attempted) to model a neural network via programming. I suppose that's not saying much either, since we can't fully comprehend how those work; otherwise computer vision (or hearing) wouldn't be such a problem.
But at the same time I don't think these things were meant to be easy. After all, it took some 40 billion years for our chemical compositions to evolve into us. After 40 billion years of natural selection i suppose it only makes sense that we're able to do some of the things we're capable of.

Topics like these make me want to go sit outside and stare at the sky for ten hours and think.
I don't know why, Thumper, but your reply made me think of this:

http://www.youtube.com/watch?v=OQGzEIJaEVE



But yeah. The stuff our brain does is astounding. Even just basic coordination to do something we take for granted, like standing up, is surprisingly complex. Nevermind all the sensory perception stuff.
Hahhh! oh man, I got a good laugh out of that. Mostly because I feel like that's how I live. Over-analyzing everything, being too astounded by life. That was too funny, man. I'm definitely going to use that in a joke or something in the future.

Speaking about unusual 'side-scenes' that play in our heads (from a tv show or something) when a similar situation occurs in life, that happens to me all the time. I'll be in the middle of a conversation with someone and then visualize a scene from something, usually Family Guy, American Dad, or the Whitest Kids You Know, and end up laughing hysterically. For example, my boss was talking about his time in the air force and he worked on building a system that trained pilots for what it would feel like if a plane malfunctioned and either leaked oxygen or went into rapid decompression. But he was going on about how if you loose oxygen subtly you won't even notice, but you'll slowly get really stupid. People would be told to do tests, simple things like writing their name or putting shapes through corresponding holes and not be able to do that, at which moment i thought of Stewie saying (as he's being rolled out of his room in his crib) "What are you doing Lois? This rectangular crib will never fit through that door. You need the circle to do that; oh man I really need to get better at shapes."
Thumper wrote:
I also think that what you're saying has some sense to it, L B. If you have an exact copy of the voices for an entire movie, theoretically it would make sense that if you invert that track and play it simultaneously with the movie then you wouldn't hear any of the voices.
No, I'm saying this: you have compound sound tracks A, B, and C:
A = X + Y
B = X + Z
C = X + W
Theoretically it should be possible to factor out X to some reasonable accuracy. It could not be perfect except in ideal scenarios, obviously, but it would be pretty good.
Last edited on
Theoretically it should be possible to factor out X to some reasonable accuracy


It's certainly possible, but I think you're oversimplifying the concept.

Again I feel compelled to bring back your analogy of imagery.

If I handed you 3 pictures that had a common background, but each one had a unique image watermarked on it... you can imagine how difficult it would be to extract the background and remove the watermark from those images.

It's just as difficult to separate combined waveforms. Separating one part into two of its components is just much more difficult than combining two different parts.
Topic archived. No new replies allowed.