User:Dallan/SourceExtractor

Watchers

Contents

What is Source Extractor?

Source Extractor is a (free) program to help you extract sources from notes in a GEDCOM file. Early versions of PAF didn't support Sources, so people often entered their source information in the Notes area. If you have a GEDCOM file with source information in the Notes area and you want to generate Sources for that information, Source Extractor will help you do this.

How does it work?

Source Extractor works as a three-step process:

First Run the Notes Extractor over your GEDCOM file. This creates an alphabetically-sorted list of all of the note lines in your GEDCOM.

Second Edit this file to group the lines together into lines representing each source. You should leave blank lines between the groups. In the next step, a source will be added to the gedcom file for each group, and citations to that source will be added every time a note matches one of the lines in the group. For example, if you had:

 
Allen County Marriage Records
Allen Co Marr Recs
Allen Cty Marriage Recs
 
Allen County Birth Records

in your grouped notes file, then two sources would be added to your gedcom file in the next step: one for Allen County Marriage Records and one Allen County Birth Records. For every individual with a note line of "Allen County Marriage Records" or "Allen Co Marr Recs" or "Allen Cty Marriage Recs" we would add a citation to the first source, and for every individual with a note line of "Allen County Birth Records" we will add a citation to the second source.

Third Run the Sources Extractor over your GEDCOM file, giving it the grouped notes file you just created. Sources Extractor will generate an updated GEDCOM file with new sources and source citations based upon your grouped notes file.

How can I get it?

You can get the program from http://www.quass.org/pafutils.jar. Instructions for running it appear below

How do I run it?

Warning: In its current state, Source Extractor is not for the faint of heart. It requires installing Java onto your computer (a ~20MB download) and entering data into a Command Prompt window. Also, I've tested this with Java 5.0 but not 6.0, although it should work.

1. Download JRE 5.0 or 6.0 from Sun:

  • go to http://java.sun.com/javase/downloads/index.jsp and click on the button to download the "Java Runtime Environment (JRE) 6u1".
  • accept the license, then click on the "Windows online installation" link, then follow the instructions and install it.

2. Make sure you can run java

  • go into accessories from the Start Menu and open a Command Prompt window
  • type: java -version
  • you should see something like java version 1.5... or 1.6... or something like that

3. Save pafutils.jar (available from http://www.quass.org/pafutils.jar) (available from http://www.quass.org/patterns.txt) to someplace easy to get to from the Command Prompt window. I'd suggest doing the following, but you can save it anywhere.

  • from the command prompt window, make a new directory: mkdir \gedcom
  • save pafutils.jar and patterns.txt to c:\gedcom
  • from the command prompt window: cd \gedcom
  • from the command prompt window: dir
  • you should get a list of pafutils.jar and patterns.txt

4. Create a gedcom file from your pedigree and save it in the directory you just created
5. Extract the notes. Let's say your gedcom file is named mypedigree.ged

  • from the command prompt window: java -cp pafutils.jar org.werelate.pafutils.NotesExtractor patterns.txt mypedigree.ged notes.txt
  • remember the number it prints out at the end. You'll need this in step 9.

6. Review the notes file you just created

  • go into accessories from the Start Menu and run Notepad
  • open c:\gedcom\notes.txt
  • you should see a list of all note lines in your gedcom file
  • you're eventually going to group these lines together into lines that should be converted into sources
  • some of the note lines will have citation information or noise in them, which we'll try to remove in the next step

7. Review the patterns file I sent you

  • open c:\gedcom\patterns.txt in Notepad
  • the patterns are written in a language called "regular expressions." In a nutshell you specify patterns that you want to match, and the NoteExtractor removes those patterns from the note lines before writing them to the file. You can see that I've added patterns to match birth/christening/etc and number) at the beginnings of lines (^ means match the beginning of the line), and page numbers at the end of lines ($ means match at the end of the line). You'll likely want to add additional patterns to this file. If you do, would you please send them to me? I'm trying to create a fairly comprehensive list of noise and citation patterns.

8. Group the note lines into sets of lines corresponding each source to create

  • once you're happy that the citation information and noise has been removed from the note lines, you'll need to group the lines together into groups of lines for each source. You should leave blank lines between the groups. A source will be added to the gedcom file for each group, and citations to that source will be added every time a note matches one of the lines in the group. For example, if you had:

Allen County Marriage Records
Allen Co Marr Recs
Allen Cty Marriage Recs

Allen County Birth Records

in your grouped notes file, then two sources would be added to your gedcom file in the next step: one for Allen County Marriage Records and one Allen County Birth Records. For every individual with a note line of "Allen County Marriage Records" or "Allen Co Marr Recs" or "Allen Cty Marriage Recs" we will add a citation to the first source, and for every individual with a note line of "Allen County Birth Records" we will add a citation to the second source.

9. Once you have grouped the notes file, you need to create the source citations

  • in the line below, N is the number from step 5.
  • from the command prompt window: java -cp pafutils.jar org.werelate.pafutils.SourcesExtractor patterns.txt mypedigree.ged notes.txt mypedigree2.ged N
  • import mypedigree2.ged into PAF

Who wrote it?

The program was originally written by Josh Monson, with modifications by Dallan Quass and guidance from Don Snow.

I know Java - how can I help?

Leave me a message. If other people are interested in contributing I'll put it on SourceForge.