wav format - data chunk

I have found two pages which describe wave format:
https://sites.google.com/site/musicgapi/technical-documents/wav-file-format#data
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html

But I am a bit confused. I thought that there should be written two data values for each sample channel: volume (16b) and frequency (16b) ... So 4B for Mono sample. But there is written that there are used 16b data only for each channel. I did not find what this data mean. If it is not volume nor frequency than what does it represent? I see the two bytes are often represented as floats. Is it possible to calculate frequency and volume from the sample? It just does not give sense to me how they can use one short number to describe both.
If it is not volume nor frequency than what does it represent?

Uncompressed WAV files generally work with discrete samples in the time, not frequency/magnitude (phasor domain, fourier transforms and all that jazz).

Look at the image in this wikipedia article:
https://en.wikipedia.org/wiki/Sampling_(signal_processing)
Each data point is the magnitude (edit: not magnitude, just position) of the waveform at a certain time.

Imagine a pure sine wave, and then measuring what y = sin(x) is, but doing this 1000 times every cycle of that sine wave.
Your samples array will look like this:
{ sin(0), sin(1/1000), sin(2/1000), sin(3/1000), ... }

In CD-quality WAV files, the data is recorded at 44.1 kHz, meaning 44,100 samples a second.
Each of this samples is an "element" in the array of data subchunk
http://soundfile.sapp.org/doc/WaveFormat/

______________________________________________

Sound is produced by varying longitudinal waves of low and high pressure that traverse through the air.
In WAV, the "low pressure" means lower data sample values, and the "high pressure" means higher data sample values. When played, it tells your speakers how to vibrate to produce sound.
Last edited on
Thank you a lot for the link http://soundfile.sapp.org/doc/WaveFormat/ which is very helpful. This is what I was missing - the images help me a lot to understand. In the image on the bottom of the page there is sample 2 (right channel) with values 0x24 0x17 and 0x1e 0xf3 , does that mean there are two samples in the sample2? There are two 16b values.

Is it possible to get volume/time information from the sample?
WAV format can be both mono and stereo (in fact, can have many channels), but in the stereo version of it is ordered like this: { Left channel at time 0, Right channel at time 0, Left channel at time 1, Right channel at time 1, Left channel at time 2, Right channel at time 2, etc. }

does that mean there are two samples in the sample2?

Yes -- one for each channel. Note there can be some variation in terminology. By "sample" in the link you mentioned, they mean point in time. In other words, in stereo, at every point of time, there are two channels (left and right), and a value for each channel.

In your example,
0x24 0x17 is the 16-bit value of the left channel at time 2 ("sample 2")
and 0x1e 0xf3 is the 16-bit value of the right channel at time 2.
Hope that makes sense, let me know if it doesn't.

Is it possible to get volume/time information from the sample?

This requires some calculation.

For pure sine waves, the volume is easy. It's just the amplitude of the sine wave, the A in A*sin(t). But for real signals you'll encounter, you need to do what's called "loudness metering" on a signal. One method to do this is to simply define a window that you'll collect samples in (say, a few hundred milliseconds worth of samples), and then find the maximum value in that array.
You move this window so that it looks at samples in range [N, N + k] at a time.


For example, of your signal is {0, 3, 4, -9, 4, 5, 7, 5, 4, 3, 2, 1, 0 },
and the window size is 3,

 {0, 3, 4, -9, 4, 5, 7, 5, 4, 3, 2, 1, 0 }
 [.  *  .] --> |max| = 4 at time 1. Volume is 4.

 {0, 3, 4, -9, 4, 5, 7, 5, 4, 3, 2, 1, 0 }
       [.  *   .] --> |max| = 9 at time 3. Volume is 9.




(Note: For samples that go into the negatives, you take the absolute value)

The SO link below gives alternatives to the above method, which can be more accurate. Look at where the answerer talks about RMS.

The two StackExchange links might explain it better than I can:
https://dsp.stackexchange.com/questions/46147/how-to-get-the-volume-level-from-pcm-audio-data
https://stackoverflow.com/questions/8282394/finding-the-volume-of-a-wav-at-a-given-time

The time information of a sample is simply the index of that sample in your array.
If you're sampling at 1000 Hz and the 1st sample is at 0 seconds, then the 2nd sample is at 1/1000 seconds, 3rd sample is at 2/1000, etc.

Edit: Fixed mistakes.
Last edited on
I did not checked the image correctly. Now I see that there are arrows pointing to exact locations, but I thought like there was group of samples on right and group on left, that was completely wrong :-)
Thanks for explainning Loudness metering, great stuff.
You're welcome. I'm not too experienced with the absolute best way to get the volume, but the methods described in the links sound like they should do.

Going back to your first post,
Is it possible to calculate frequency and volume from the sample?

Frequency is a whole other beast.

Frequency requires taking the Fourier Transform -- in discrete signals, it's called the Discrete Fourier Transform, usually implemented as the Fast Fourier Transform (FFT). This gets really mathy fast, but is one of the fundamental parts of digital signal processing.

The Fourier Transform transforms time samples into complex sinusoid frequency samples (a complex number, a + bi). The magnitude of one of these complex samples is the amplitude of that given frequency.
https://en.wikipedia.org/wiki/Fourier_transform
https://en.wikipedia.org/wiki/Discrete_Fourier_transform
http://www.robots.ox.ac.uk/~sjrob/Teaching/SP/l7.pdf
Last edited on
It's worth noting that a Fourier transform is not a <single sample> -> <single frequency value> function. Any Fourier transform or equivalent will accept an array of samples (in what is known as the time domain) and return an array of amplitudes in the frequency domain. In other words, just like a waveform is amplitude as a function of time, the output of an FT is amplitude as a function of frequency. For example, if you pass a block of samples containing a pure 1 kHz sine wave to an FT, ideally you'll get back a spectrum that's completely zero except for a large peak at 1 kHz (in reality you'll get peaks at 1 kHz, 2 kHz, 3 kHz etc. due to harmonics).
Thanks, the way I said that probably wasn't the clearest.

Although, in my experience, the distortion of the FT isn't due to harmonic interference, it's rather due to having a finite window that the FFT is used on. If the size of the FFT was infinite, you'd get a pure delta at positive and negative 1 kHz (but then, alas, you lose all time locality).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function test()

    % FFT size 1024,
    % Sampling freq 44.1 kHz
    % signal frequency 1000 Hz
    Fs = 44100;
    f = 1000.0;
    N = 1024;
    T = 1.0 / Fs;
    n = 0:N-1;
    y = sin(2 * pi * f * n * T);

    freq_bins = n * Fs / N;
    Fy = fft(y, N);
    %Fy = fftshift(Fy);
    Fy_mag = abs(Fy);
    plot(freq_bins, Fy_mag);
    ax = gca;
    ax.XLim = [0, 44100/2]; % Stop at 22500 Hz
    ax.XTick = 0:1000:20000; % Tick every 1000 Hz
    set(gca,'XTickLabel', 0:1000:20000);

end

There's no harmonic peaks here, even at low N.

Edit: Ooooh you're referring to real signals? If so I 100% agree, a piano note or whatever will produce harmonics that also have significant peaks.
Last edited on
You're probably right. I honestly don't know all that much about Fourier analysis, just the absolute basics.
I should clarify what I said too, in a finite window, the power will "leak" into other frequencies other than just 1000 Hz (in the example). Especially frequency bins very close to the 1000 Hz one.
https://en.wikipedia.org/wiki/Window_function#/media/File:Spectral_leakage_caused_by_%22windowing%22.svg
The fact that there is spectral leakage is what's still important -- which is in essence what you were talking about.

In other words, there's always going to be some uncertainty to the measurement.
- We can take a large amount of FFT bins, but that requires a lot of samples in time, making our measurement more uncertain in time (we don't know exactly when a particular frequency is happening in a signal).
- Or, we can take a very small amount of FFT bins, which requires only a few samples in time, but this then makes our measurement more uncertain in frequency.
- It's a trade-off: You can't have both perfect frequency and time resolution (but windowing functions can help).
Last edited on
Topic archived. No new replies allowed.