This is a program built with Java j2sdk 1.4.2. in support of Qsearch. Its purpose was to extract data from MsWord based files stored in a very large folder base database.
The main problem was that these word documents based database didnt exercise a strong format. Thus files out of format must be detected fixed or added as a pattern into the search.
The data extracted is to be outputed in text files in the CSV(comma seperated values) format as this was a format that MSAccess could import.
An example of the word document it was made to scan:
The program in action.. with some results in those text files.
The challenges i had while writing this program was:
- to teach java how to skip jibarish headers on word documents when open up in text file format by java. You can see this by renaming any .doc to .txt
- Look for common characteristic in the word documents.These documents had low enforcement on formatting therefore resulting in so many versions of Qextractor where each version had some "fixing characteristic" to find the common characteristic and output the file names that was out of line and to fix it manually if it is less than 10 .Anymore than that would be to fix it by writing a subprogram to put them in line or to teach Qextractor to accept more wider characterics of input. Example searching for colour would to scan for "colo" rather than colour where sometimes it is spelled color.
- Traverse all directory and find every word document.
- one large problem faced that was not overcome was scanning for table based data in those word documents. It was impossible or hard to include this ability for Qextractor tas there was only less than 3% of the files having this characteristic therefore these files were skipped and manually entered into the database instead.
- Characters had to be scanned as their approriate ascii numbers to maintain integrity as word had lots of formatting headers and charater based formatting.
Some other characteristics(those i can remember hehe)
- Extracts Quotation data from word document and saves them in respective text file in CSV format.
-Gives error if expected value is not found in file
-Seperates contact details, company and customer(people in that company) and does NOT-REPEAT them
I guess that is all i can comment on the program for now please have a look at the sample data output and sample word document the program was scanning.
Qextractor10 .java & .class with some test documents it was made to scan from.[
Download]
*Checkout the Qsearch made in MsAccess ,the main reasone this program was made.