I need help - urgently

Pages: 12
Hello,
I am new to this forum, and here is my problem.
Topic: accountancy program

I am currently making a C++ program that is gonna compare a huge amount of PDF files, and they are gonna compare number and dates, and find equal matches, the problem is that the dates and numbers varies on each pdf.

My idea so far is:
Folder 1 - one row of pdf files
Get compared with
Folder 2 - 2nd row of pdf files

But I am struggling for it to compare dates AND numbers.


Anyone that could help with ideas of how to code it?
This is for helping high school students the benefits of digitalization.
I hope someone responds since I need this to be done.
Please?
What are this numbers/dates? How do you obtain them?

What is the difficulty when you try to compare?
I have 2 folders:

Folder 1
Contains maybe 1000 PDF Files

Folder 2
Contains Maybe 1040 PDF files

Here the program need to say that there is an unequal amount of PDF files.

2nd.
The PDF files will have Date a random place in the pdf file, and there are alot of filler words, and I need it to only pick out the largest number in folder 1 PDF 1 and compare it to the largest number in PDF folder 2 PDF 1.

So it should come up
PDF 1 Date & Largest Amount = PDF 2 Date & largest Amount

The program has now read the PDF file and ignored the "filler" data.

3rd.

The program will not care about the inital name of each pdf file, since they can vary.



My problem is that I cant make the program read and compare 2 datas (Date and Amount) for each PDF file.
I am not sure if it is an impossible task :(

I know it is really advanced, but I hope some of you might know what could be done.
It certainly is not impossible.
Can you upload one sample document so we can see the structure?
Here is one article that might be helpful.
https://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file


Folder 1
http://www.responsive.co.nz/ledgerplus/images/printcustomeraccount.gif

Folder 2 will just contain PDF with
Date - Amount.

I dont know if this is any help


Edit:

I am unsure how I upload PDF files in this forum.
Last edited on
You can't upload files here in the forum.
You could upload it to Dropbox, Google Drive, One Drive.....
I can send a share folder to you in PM if you open that possibility
I can pay if that is what I have to do
I think it has nothing to with getting paid.
It's a rather complicated task in C++ and not so many people here have experience with pdf libraries.
I only could it in C# or maybe Java but this might not be an option for you.
What is this project about - homework ?
You can do it in Java if you feel that is easier.
It is not a homework. It is to show how the changing in digitalization can fast change stable jobs as Accounting and controller jobs to High school kids, and that they should keep up with the digitalization trend since it will change most jobs.

I already have a deal with a GUI programmer that will put it in a format so it does not look so "boring"/Advanced.
are you comparing the files (byte by byte) or the OS level data (OS timestamp an file name) or data in the files or something else (?).

Comparing the files on data in the files, regardless of bytes or file name.

It does not matter for me what programming language that is being used, I just thought C++ would be the best match for this.
But if you think that Java or C hash is better, that is fine with me
To compare the data in a pdf is going to be tricky, you have to extract the text and images and compare those. Java and C++ are more or less interchangeable for this; Java has some really weird "you can't do that" limitations that prevent doing some things (like math code, due to lack of operator overloading you can't do some math work cleanly) and pointer work (which is falling to the wayside in c++ but you CAN do it where you need to). But both languages can solve this problem. Java is more portable, c++ is usually a little faster (often, too small to even measure).

I would find a pdf library that you like and use a language that can interact with it.
All major languages can do the rest of the work around the library.
Last edited on
I dont know how to make it.
Could anyone help?
In .NET iTextSharp would be a good option.
https://sourceforge.net/projects/itextsharp/
or
http://www.pdfsharp.net/

In what language would the GUI be written?
As with any complicated problem, break it down into parts:

- iterate over PDF files in a directory
- open a PDF files
- extract the text from a PDF files (presumably, using some kind of existing library, as others have said)
- identify the date in the text from a PDF file
- identify the largest number in the text from a PDF file
- compare dates
- compare numbers

Once you have those building blocks in place, you can hopefully put them together to get the functionality you want.
has anyone programmed this before?
Pages: 12