Machine Learning

Hi all,

lucky for me i have been tasked with finding a ML solution for our current project, and i'm hoping some of you have a little more experience than me in the subject. i've trained tensorflow for the flowers and mnist datasets, and read a lot on the internet :)

i have n channels of data, normally we generate a histogram for each and are interested in the bin counts in specific ranges. generally we create a plot from two histograms that looks like this (not one of ours) https://python-graph-gallery.com/wp-content/uploads/80_bivariate_kernel_density_plot2-300x300.png

We are interested in populations at specific locations on the plot, not necessarily their shape.

I tried training googles Inception with a few hundred images but never seem to get over 80% which leads me to think that "computer vision" may not be the best approach.

it would be nice to solve this from the raw channel data, but binned data is also available as well as the plots mentioned.

does anyone have any ideas on what an "appropriate" ML approach would be for this kind of problem?
Last edited on
I think I don't even understand the problem.
* n channels of data? What does that mean? Your data is a sequence of n-dimensional points, or what?
* What do you mean you are "interested in populations at specific locations on the plot"? Interested in what way? What do you want to know about those populations?
* You don't get over 80% of what? What's the success criterion?
sorry for the late reply, been helping offspring move house.

in c++ terms, i have several arrays of values (millions of entries) each representing readings from a different sensor. each entry is a single uint32 value.

* n channels of data? What does that mean? Your data is a sequence of n-dimensional points, or what?


1
2
3
4
5
6
struct data {
   uint32_t chanA[1000000];
   uint32_t chanB[1000000];
   uint32_t chanC[1000000];
   ...
};


I take a channel (identified by the user) and bin the values to create a histogram for that channel.

I also take any combination of 2 channels (also identified by the user) and use those histogram bins to generate density plots like the link in the OP.

A user has to visually examine a density plot to determine if a population (density point) exists at a specific location on the plot. (the intersection of 2 bin ranges) this human activity is what i would like to remove.

You don't get over 80% of what?

80% success rate, the trained inception model is only correct 80% of the time at best, which has lead me to think that inception is inappropriate for determining if a population exists at a specific location. rather it is good for determining that a population is present but doesn't really care where it is in the image.
> what an "appropriate" ML approach would be for this kind of problem?
I still don't understand what problem you are trying to solve

> A user has to visually examine a density plot to determine if a population
> (density point) exists at a specific location on the plot.
1
2
if hist[x][y] > tol:
   return true
¿what's a density point?

> 80% success rate
I'll repeat the question, ¿what's the success criterion?


> i have n channels of data (...) generally we create a plot from two
> histograms (...) A user has to visually examine to determine if a population
> exists at a specific location
you are describing your current "manual" solution, not the problem that you're trying to solve
¿what's the user looking for? ¿cluster separation? ¿is it necessary to limit to 2d?


> I tried training googles Inception with a few hundred images
the number of test cases depends on the complexity of the nn used, the input variables and the problem complexity
«a few hundred» sounds small
Last edited on
> A user has to visually examine a density plot to determine if a population
> (density point) exists at a specific location on the plot.
if hist[x][y] > tol:
return true

in its most basic form yes, but i've been asked to come up with a machine learning solution so it can be trained for different scenarios without us having to hard code them all.

I'll repeat the question, ¿what's the success criterion?

my boss will say 100% successful recognition that a population exists at a specific place in the data. It doesnt have to be 2d, the actual data has hundreds of channels which we consider to be hundreds of dimensions for the dataspace.

you are describing your current "manual" solution..

yes, i am looking for a cluster, we looked into principle component analysis but that appears to provide relative distances between the clusters rather than an absolute location. (perhaps i misunderstood)

I also looked at Self Organising Maps, but i got the impression they would be more effective at describing the "shape" of a population.

and yes, a few hundred is a very very small test data set, it took months to get hold of those! we're still trying to get more of course but its a long drag in clinical systems.
So basically you want blob detection, right? A given population will cluster around a point and you want to find that point given a histogram. Does the specific shape of the blob matter, or only the size (I assume you're interested in the largest blob)?
Yes, blob detection is what i've been trying to do, it was my obvious "first step" based on existing user activity, I would prefer not to generate the plot at all and train the ai on the unbinned arrays but dont even know if thats possible.

A given population will cluster around a point and you want to find that point given a histogram.

almost Helios, i already know the point (roughly) and am tying to determine whether a cluster exists there.

In all of the training data I have the blob is very isolated, i haven't seen a second cluster close to it, although with this small dataset i guess i cant make that call definitively.

the shape does not matter, and to a lesser extent neither does its size, the size appears pretty consistent width and height (about 10% of the 1024 bins) across the dataset.

if there were more than one cluster I'd be interested in the densest (if thats a real word).





For the picture that you showed in the original post, what "answer" would you like to get?
@lastchance
note: "sepal" is just the name on the example plot axes, i have no idea what it means to the originator of the plot.

also, i have two inception categories "found" and "not found"

what "answer" would you like to get?

Lets presume that when comparing sepal width/height a cluster at x = 3.0 .. 3.2 and y = 6.5 to 6.7 should result in "found" otherwise "not found", so the plot given would be categorised as "found"

If we had a second example plot and that top cluster (on the example plot) was elsewhere, i would categorise that as not found.

the "truth" that a cluster exists at the given position is the answer i'm trying to determine.
Is your picture (with "sepal width" and "sepal length") related to something like this classic statistical dataset:
https://en.wikipedia.org/wiki/Iris_flower_data_set
Dunno if reference 6 on that page (Farber et al.'s paper) might be relevant: not my field.
Or else: https://en.wikipedia.org/wiki/Cluster_analysis


Just thinking off the top of my head:
(1) Renormalise the pixel plot using min and max values at those pixels so that the range is 0 to 1. This might not be linear scaling.
(2) If (say) a tenth of the bins exceed a normalised value of (say) 0.75 then there is insufficient variation to justify any clusters; so any position will give "not found".
(3) Otherwise, clusters will exist in regions where normalised value exceeds a threshold - say 90%.

All those thresholds will need to be calibrated to some similar data (and the instrument capabilities of whatever is gathering it).

You may need to filter first to eliminate noise; e.g. average over local pixels or (better) FFT and low-pass filter.
Last edited on
Is your picture related to ...

It looks like thats where the image came from, but no, i just picked it from google images.

We already use distribution and density based clustering and a colleague is working on a principal component analysis algorithm for us.

The problem we are trying to resolve is that these methods require the dev team to know the answers in order to code the "found" "not found" tests. (meaning we, the devs, need to know where to look for the cluster) ML would give us a mechanism that our userbase could train with their own data, thus deferring the "found" "not found" test to the customers training data, this would open up our markets significantly and our customers would not have to provide us with their proprietary datasets.


Topic archived. No new replies allowed.