Fastest algorithm to conditional search within two large text files

Hi experts,
I want to compare the contents within two large text files:
i.e.
In file A (n lines):
x11 y11 x21 y21 C1
x12 y12 x22 y22 C2
.
.
.
x1n y1n x2n y2n Cn

In file B (m lines):
a11 b11 ... X11 Y11 X21 Y21 ...
a12 b12 ... X12 Y12 X22 Y22 ...
.
.
.
a1m b1m ... X1m Y1m X2m Y2m ...

I want to check for all i in m
if {X1i, Y1i, X2i, Y2i} is inside {x1j, y1j, x2j, y2j}

Right now my pseudo code is like this, which is brute force:


It took much time to do it, if there is any faster way to do it?

1
2
3
4
5
  for line in file A:
    for line in file B:
       if {X1i, Y1i, X2i, Y2i} is inside {x1j, y1j, x2j, y2j}
       append C1 to [a11 b11 ... X11 Y11 X21 Y21 ...]
It took much time to do it, is there any faster way to do it?

What does it mean for {X1i, Y1i, X2i, Y2i} to be inside {x1j, y1j, x2j, y2j}?
Many fast search algorithms (for example, in strings) utilize the properties of this test to avoid extra work.

Can you give a small illustrative example?

How large is n relative to m?
Last edited on
Please give a snip of real data, not made up, for the above.
and I have a weird question:

if its looking for exact word matches (eg 123 is not matched vs 1234) are all the terms < 9 bytes?


Last edited on
Thanks for prompt reply:

Here are examples:
Where:
1. All the bold fonts are bounding box (xll, yll, xur, yur) (Do not care about the formats, I will convert.)
2. All coordinates in File A will be divided by 2000, so the units are consistent

I want to check if the bounding box in File B line by line is in the File A bounding box line by line. If I still do not explain clearly enough please let me know, thank you all for the help.

File A:
(-2, -2) (2110, 2126) V: 0/42/51 H: 0/46/48
(2110, -2) (4210, 2126) V: 0/41/52 H: 0/45/48
(4210, -2) (6310, 2126) V: 0/38/52 H: 0/43/48
(6310, -2) (8410, 2126) V: 0/37/52 H: 0/40/48
(8410, -2) (10510, 2126) V: 0/38/52 H: 0/42/48

File B:
{0.48 3.312 0.768 3.888}/ 0.158830054012/ 1/ 1/ 1/ 0/ 0/ 0/ 17.9895355819/ 5/ 31.5921879642
{0.48 1.008 3.168 1.584}/ 0.0732938574735/ 4/ 1/ 0/ 1/ 0/ 0/ 55.5351438983/ 16/ 6549.4583645
{0.48 1.584 3.168 2.16}/ 0.0732938574735/ 4/ 1/ 0/ 1/ 0/ 0/ 55.5351438983/ 16/ 6549.4583645
if its numeric, it will be faster to process as numbers, not as text, to start with. But you sound like you are already doing that, because you said you divide by 2k, presumably not as text (now that would add a lot of time to the thing!!). Clarify?
One key specification here is volume in these files. Suggestions will be considerably different if either of these files have billions of entries, vs millions, vs a few thousand.

Modern machines with a few GBytes of RAM and 64 bit processors could easily load two files like this into RAM with tens of thousands of bounding boxes, and conceivably millions if done with reasonable efficiency. A good development machine with at least 32 Gbytes of RAM could easily manage a few hundred million. Unless otherwise prohibited by specifications, I'd highly recommend loading the data into RAM, probably sorting one of the two files, to make it quickly searchable content (probably file B if I follow your discussion correctly).

I say sorted, but there are lots of choices on organizing spatial data you could choose. A simple sort, however, can make searching faster.

I can't quite determine if you are searching exclusively for exact matches or for bounding boxes in B which contain a bounding box taken from A. The methods are quite different for these. Exact matches, of course, require that all coordinates match, and as such if file B is sorted by a composite key ll_x, ll_y, ur_x, ur_y (something like last name, first name), you could find the matching entries with a simple find (and subsequent sweep forward if there are multiple matches in B).

However, if you are trying to find bounding boxes in B which contain each bounding box from A, the benefit of sorting makes for a far different algorithm. Let me know if this latter concept is what you require.
@jonnin,
Both files are big (File A is around 127K lines, and File B is around 144K lines), and the there are a lot of unrelated numbers and strings in the lines.
These are not the largest ones, there are ones with over 1M lines each.
@Niccolo,
Thanks for the advice, I will try to load them into RAM and see what happens.
I am looking for if a bounding box from File B is in any bounding box of File A (there will must be one-to-one relationship, but some might be repeated since each bounding box in File A is larger than each bounding box in File B).
Sorting is a great idea, I will try this also, thanks a lot for the great recommendations.
I remain unimpressed.... a typical desktop has 32-64 GB of ram, a million lines of text isn't what it used to be :)
I don't think memory is your problem, use as much as you want with whatever algorithm makes sense.
it looks like the first 4 numbers are all you care about, and the rest of the line is fluff, in which case, I stand by reading the numbers as values not text. Yes, you get to ignore () chars and do a little minor parsing, but its still going to be more efficient, I think.
Read file A, extract the numbers divide by 2000 and store in a vector.
Sort the vector by left edge.
Now read file B. For each box in B:
- do a binary search for the left edge of the B box in the A vector. Call this index L.
- Do another binary search for the right edge of the B box Call this index R.

For B to be contained within a box in A, the index of the A box must between L and R. Compare against those boxes.
and, you may be able to cut the binary search down to a O(1) lookup if that isn't good enough.
and, you may be able to thread it to get a flat multiplier speedup per core, so 4 to 8 times faster on a PC or many many more on a corp server?
I don't know your data but there may also be a way to make meta-boxes so you do less work. eg if you have 1000 boxes and 500 boxes are all inside one big box... and the other 500 are split into 5 or 10 boxes … cuts it way way down.. and basically, there may be a way to eliminate redundant searching if the data is very clustered. But I am not sure how much clustering is needed to justify the extra processing to set this up and work it.... sometimes you can exploit knowledge about YOUR data to do better than a generic algorithm that has to work for ALL data...
Last edited on
Topic archived. No new replies allowed.