is there a built in method for detecting outliers ?

Pages: 12
Hello everyone, I am new to c++ and I need your help :)

When I was using a matlab, I was using the method "filloutliers". I was wondering if there is something similiar like that in c++ ?

In other words, I want to know if there is any sort of a built-in method in a certain library that detect outliers in a data set and replace them ?

Thanks in advance
not in pure c++.
there is probably a statistical library that will do it, or you may need to write one. Also, matlab can export c++ code, but its terribly slow and uses a black box library.
Last edited on
The C++ Standard Library does contain some generic algorithms. The thing you seek is specific; not in standard.

There are third-party libraries and toolboxes that provide more specific algorithms. For example, the Wikipedia page about Matlab states that Matlab is written in "C, C++, Java". In other words Matlab's filloutliers could have been implemented with C++.

There are many third-party libraries. Quick search hits:
http://linasm.sourceforge.net/docs/api/statistics.php
https://www.gnu.org/software/gsl/doc/html/
(The GSL is C, but can be called from C++ code.)

I have no idea whether the exact algorithm you seek has already been written by anyone else than Matlab.

There is a balance between the use of a tool that already does a thing right and the reinvention of a wheel.
@jonnin, well i do not have the matlab coder kit anyways so it would not work. Are you familiar with any statistical libraries that i can refer to ? Thanks
@keskiverto, you are absolutely right. However, I am still in the learning beginner phase. I would have checked how it is done .. try to understand it and maybe make my own for the sake of fun of it .
I don't know what the go-to libraries for stats are. I had my own coded up but it can't do this task -- mine was really just a correlation finder engine for training AI (you only want things that correlate to the correct answer in the known data samples as your inputs).

the good news is that matlab is documented to death. It should explain exactly what the routine does, making it fairly easy to re-create.

its probably not to hard to make a crude one... use the std-dev of the sample against each point... with some threshold, say if its more than 50% out past that?
Last edited on
Use std::partial_sort at both ends to identify whatever percentiles are required. Copy the intermediates to a new array.
@jonnin, Yes I believe that what I will do if I do not find already an implemented method. Thanks a lot for your help !

@lastchance, I am not sure if this will be the best approach for my purpose later. I am working on a real-time application. I am collecting the data from a gyroscope and a very noisy accelerometer with a frequency that reaches 238 H.z. So I want to replace an outlier as soon as it is detected depending on the earlier data. Moreover, there will be sensor fusion so I want the algorithm to be as fast as possible
real time... you will want to KEEP the outliers in one set, and remove from another, maybe.
Over a long time, the outliers could begin to be the trend if you are seeing a problem. If you discard them and then add a new point and discard that one and then add a point and discard that one... you just tossed 3 points that were grouped up and starting to form a new normal that could be very important. Be careful what you throw away, and how you approach this. I am used to after-the-fact analysis, not in-flight ... so run this on the full, no discard set each point, but display without outliers, sure...

sensor fusion up front could correct outliers, if the thing you are doing stats on is what the multiple sensors are measuring. If you fuse first, you may have already handled any false outliers via fusion.. ?

a cheap cpu does billions of operations/ sec. 238 is not that much, though you do want to be efficient, of course.
Last edited on
@jonnin

Thank you for this tip ! I would have not thought of keeping the outliers to investigate the behavior later ! .

I am working on a project where I am calculating the orientation if my board (Euler angles) using an accelerometer and gyroscope. They are both noisy (however the accelerometer is kinda more noisy)

So far what I have done is that I can read the raw data and convert them to Euler angles (each one individually for investigation purposes).

for the sensor fusion, I have no information over my Q,R matrcies so what I do is , I keep the board at "hold still" positon and I read the first N readings. I calculate the standard deviation for each axis for both sensors. However, the values of the accelerometer are way less then the gyro ones ( maybe becase my units are in g not m/s² ?) So the filter tends always to trust the accelerometer more.

Anyways, before going on with fusion, I thought it is useful anyways to detect outliers of my data then try to smooth them. because they are all kinda of spikey.


@HSafy,
What is your criterion for "outlier" and how many "outliers" in succession are you prepared to tolerate?

Do you have to replace the outlier data or can you simply ignore it?
Last edited on
there are lots of things you can try. You can average the last 3-5ish readings as a data point, hoping to cut the noise down, if that is acceptable for your needs. Sometimes stupidly simple things work great in practice.
Last edited on
Every additional filtering has the cost of extra time needed. If you're sampling at 258 Hz and need to average out 5 data points before getting "good" data, that lags the system by ~19.4 ms.
This applies whether we're working in MatLab or C++, I just wanted to point that out. MatLab's filloutliers works on a buffer just like anything else would, so the size of that buffer is proportional to the amount of lag in signal processing.

lastchance wrote:
What is your criterion for "outlier" and how many "outliers" in succession are you prepared to tolerate?
If we take the question at face value, then merely copying filloutliers is what is required.
According to its documentation, filloutliers does the following:
Mathworks wrote:
By default, an outlier is a value that is more than three scaled median absolute deviations (MAD) away from the median
I also do not know of an equivalent C/C++ library, keskiverto's suggestions are probably a good place to start.

Maybe even just implement MAD yourself: https://en.wikipedia.org/wiki/Median_absolute_deviation
and use test data to compare what MatLab gives to what your own program generates to make sure you implemented it correctly.
Last edited on
Every additional filtering has the cost of extra time needed. If you're sampling at 258 Hz and need to average out 5 data points before getting "good" data, that lags the system by ~19.4 ms.

only the first time.
time: 19ms, average 1,2,3,4,5
time: 23ms, average 2,3,4,5,6
time: 27ms, average 3,4,5,6,7
--- that is, every 1/258 sec you add a data point as before, and average it with the last N which you have stored from before. I was not suggesting read 5, average, discard, read 5, average, discard.

however it does slow response time a little more than that, as it takes 3 inputs to weight a 5 value average enough to register a new state. so its about 1/2 of what you said... not 19, but ~10 lag? But you can play with that, try a 3 value window. Even may be better, try 4 or 6, see what you think of the lag vs the quality.

And the approach may not be good enough. But its something to try given its simplicity.
Last edited on
I agree, you can do that, but as you said it slows the response (e.g. you're averaging a rising signal and you only want to trigger something once the signal goes above a particular threshold).
Last edited on
One last thing... know your system. What is the maximum and average expected delta velocity / rotation / movement? That gives you outliers right off: nothing can exceed the physics of the thing, unless something bad has happened. I mean, your little airplane may only allow 10 degrees/sec turn rate, unless someone shot it and made it go into a 80/sec spin. At that point, do you still care?
@jonnin Thanks for your constructive feedback. I will keep these points in mind.

@ganado, Yes that is what I am doing now, I followed the documentation of Matlab and I am trying to implement it in my progam.

@lastchance, any rapid jump in the data is considered to be an outlier. However jonnin made me realize that maybe this jump would be a new behavior for the system. I do not know yet but for sure there will be a certain limit for outliers in successions. I think I am not sure about replacing or ignoring. I will try both and see what works best for the fusion filter.
feel free to ask after you do some analysis. You are touching on the areas I have spent a lot of my career in.
@jonnin , Thanks alot ! I am sure I will :)
Pages: 12