Pulling text info from picture?

At work we have an NFS that stores various things (most everything). Anyway, we also store some diagrams there. Lately, I've noticed that these diagrams don't get updated often so they tend to have several names that are no longer valid. I'd like to automate the process of finding which of these names are no longer valid. If I can get them to a file, I'll have it (just ping down the list, record which ones don't give a response). But getting them to the file seems like a challenge. We have a JPEG version of these in the NFS, but we also have a master visio document for them. What I'd need done is to just extract all text from the file, get rid of the irrelevant stuff (ie the legend, title, etc) and store them in a file. Would this be possible?
Last edited on
Sure no problem, if I understood what your saying.

by NFS I assume you mean network file server.
do you access from a windows client or what?

I'd like to automate the process of finding which of these names are no longer valid.


I dont understand how you can have a file name that is not valid. Do you mean a shortcut, where the file has been moved, so the shortcut is not valid ?

Or maybe you have filename001 and 002, and the older is not valid ?

(just ping down the list, record which ones don't give a response)

This makes no sense to me, I work on file systems, I'm not aware of any way to ping a file.

What I'd need done is to just extract all text from the file, get rid of the irrelevant stuff (ie the legend, title, etc) and store them in a file.

what file type do you want to extract text from ?

What I think I understand is you want to read some kind of file and save any text into a dumpfile, that you can review... You probably also want the path and filename for the data in the dumpfile. Does that sound right?

*edit* I dumped a .jpg and the output looked like
 ╪ ß ■Exif MM * ►☺☼ ☻ ↕ ╨☺► ☻ ♀ Σ☺↕ ♥ ☺ ☺ ☺


Not sure anything there is helpful, maybe yours looks different ?

Maybe you can give me an example.
Last edited on
do you access from a windows client or what?

Windows side, yea. Though I'm sure it could also be accessed from *nix, windows is all I need.

I dont understand how you can have a file name that is not valid. Do you mean a shortcut, where the file has been moved, so the shortcut is not valid ?

Sorry I may not have been clear. I have a few diagrams that have the layout and logical design of our network. These diagrams get used often to know where we need to fix something, log in to a switch, etc. Anyway, on these diagrams they have the name of each network device (switch, router, firewall, L3 switch, etc) but sometimes we change the hostname of the device and it doesn't always get changed on the diagrams. I just noticed earlier today actually that probably 25% of the devices were now labeled wrong on the diagrams, and I had to go to secondary measures just to find the name. This is kind of annoying at times because sometimes those diagrams are the only thing available.

What I'm trying to do is pull from the diagrams the names of each device, store in a temp file, then just go down the list attempting to ping them or do a NS lookup, and record whichever fail. This part I can do just fine, it's the whole pulling text out of a jpeg I'm having trouble thinking of a way to do.

Just in case I'm still not being clear, I'll describe one diagram:

The network layout is represented on the diagram as a tree with the root node in the center (core router) and leaving the root node are various other routers and switches represented as nodes also. On each node is a label that consists of the hostname of the device. I'm trying to pull out this label. The "publicly" available diagrams are in JPEG format, but there is a master file where all the editing actually happens and that's a visio file.
Now that you have explained I don't know that I can help but i'll think on it a bit more.

I was hoping that the text you wanted was not part of a jpg. Yes I have the same mind fog on how to go about doing that. I'm sure someone has done it but just not sure. Some .exe and other types files have readable text that you can pull out, but I dont' think a .JPG has that unless it was added after the fact.

Sometimes jpgs are made by hand, but often there may be a program that made them. Or at least someone somewhere had the data to make the jpgs. While jpg are a good visual representations, it sounds like what you need is a network mapping program that can store the data you need to a file vs a jpg going forward.

For now, it may just be a manual process, I would make a file list of all the jpgs. Create a text/data file with as much of the network info as possible and then try to link the jpgs for that network/subnet to the text data that you have. A html format comes to mind as something with clickable links.
Usually if you get it into a txt file it's easy to convert to a html.

You might be able to do that with visio, just not certain of it's abilities.

so you might have 10.10.10.1 and several jpgs under it
10.10.10.1
c:\temp\1.jpg
c:\temp\2.jpg
c:\temp\3.jpg
10.10.20.1
c:\temp\13.jpg
c:\temp\23.jpg
c:\temp\33.jpg

The other way would be to use the jpg filename and all subnets that it inclucdes
c:\temp\13.jpg
10.20.10.1
10.20.30.1
10.20.40.1

In other words, create a little database of which jpgs show which networks. It would then be easier to search the database file and find all related jpgs for a given subnet/ip range.

You might even add some codes to know if it's a hub, router, switch, or firewall.
http://nmap.org/book/man.html looks interesting, and it's open source.
I think you want an OCR library, I believe OpenCV can do that but I have never tried.
Edit: Would it be possible to upload a sample of the diagrams?
Last edited on
@SamuelAdams, I believe we're still on different pages here.

@naraku, I may be able to recreate an example but I can't upload any of the production diagrams for security reasons.

After googling OCR, I do believe that's what I would need. Not sure if this would be easier than just making people stick to policy when renaming devices lol.

Either way, I'll look into OpenCV and see if I learn something. Would be cool to get it implemented, though. I could see this helping the whole IT department here.
If the diagrams are rastered from text files, it'd be much easier (and reliable) to just search for whatever text you're looking for in these files, do the modifications, and then regenerate the diagrams. OCR technology is nowhere near a point where I would trust it to automatically update software documentation. Documents with drawings and basically anything that isn't pure text are particularly good at confusing OCR algorithms.
That's what I was afraid of. Maybe this isn't feasible. I may just manually create the text file from the diagrams each month or so and run it through the script.
Topic archived. No new replies allowed.