How To Create a Nabjet Submission
The NabJet engine indexes people related data from web sites, allowing it to be searched from the NabJet home page. In order to intelligently index a web site, NabJet needs basically two things - the URL to the web site and a Template definition.
The following directions will explain how to create a submission record and how to edit the template that tells NabJet how to index the web site.
Register With NabJet
The first thing you need to do is go to the My Account page and create a Username and Password. This will create an area in the NabJet site for you to create and edit your submissions.
The email address field is not required when registering. However, it is helpful to have an email address on file in case you loose your username and password. We do not release email addresses to third party companies. Please see our Privacy Policy
Create a Submission
Once you are logged in, click the "My Submissions" link on the left. From here, click the "Add" button. You will be taken to a form with the following information:
- URL ID - This is a system generated number. It is provided for reference only
- Name - enter a short name that describes the site you are submitting.
- Web URL - this is the URL to the site you want to index
- Public/Private - This determines whether users can search on the content from your submission. When you first create a submission, this will be set to Private. Once you have indexed the web site and verified that the data looks good in the Index Results tab, set this to Public and click the "Save and Index" button. This will let other users see the data from your submitted web site.
- Source Year - if all the data from the web site you are submitting is from a particular year, enter it here. For example, you could enter the arrival year for a passenger list. If you do enter a Source Year, NabJet has a nice feature that will convert Age values to birth years. For example, if the Source Year is 1820 and the data lists an age of 18, NabJet can estimate the birth year as 1802. See the Transform function below.
- City, County, State, Country - If any of these fields are common across all of the records from your submission, enter them here. For example, enter the city, county, state and country fields if you are indexing cemetery records since they are all from one location (the cemetery). This is typically easier than trying to extract this data from each record from the source web site.
- Type of Data - this is the type of data being extracted. This lets users search, for example, just on Military or Cemetery records.
- Template Text - this is where you describe to NabJet how the data from your submission is structured. For a complete explanation of the template format, see The Template Format section below. When you create a new submission, NabJet will automatically fill in an example template for you to modify.
Save Your Work
It is very important to know that if you enter information into the submission form and navigate away from the page, your changes will be lost. This includes clicking on the "Secondary URLs" or "Index Results" tabs.
To make sure your data is saved, always click the "Save and Index" button
Using One Template for Multiple URLs
Many web sites have multiple page of information, all in the same format. Let's say a particular web site had 5 pages of passenger records, all in the same format. First, you would create a new submission and enter one of the URLs in the Web URL field. Once you have the template text set up, click the "Save and Index" button, then review the results in the Index Results tab.
If everything looks good, go to the Secondary URLs tab. Continuing with our example, enter the remaining 4 URLs into the "List of URLs" field. Important: make sure you set the Type of Date, Source Year, City, County, State and Country fields. These URLs might contain different information than the original URL, even if it's in the same format. For example, the first URL might be for a cemetery in one town, and the remaining 4 might be for another town.
Once you've entered the additional URLs and other information on the form, click the "Add URLs" button. This will create new secondary submissions below the form. You can then select these records and index them individually or all together. To index secondary submissions, click the check box on the left for those you want to index, then click the "Index" button at the bottom of the page. Once indexed, the page will be re-displayed, and the number of records indexed will be shown to the right of each secondary submission. You can click this number to review the date.
Remember that secondary submissions will use the template from the original submission in the Primary URL tab.
Hint: if you have multiple URLs to submit and each is from a different location, enter one URL at a time along with the geographic information, then click the Add URLs button. Then repeat for each remaining URL. That will allow you to enter different geographic information for each secondary submission.
The Template Format
A Template is a set of commands that describes the contents of a particular web page. This description is necessary for NabJet to know where the data begins and ends, and know what all the different values mean. A template is divided into four sections:
- [START] - where on the page the data starts
- [END] - where on the page the data ends
- [RECORD] - what the format is for each record. For example, is it separated by commas? Is it in a table?
- [FIELD] - lists the data fields to pull. For example, you would have one [FIELD] section for each field you want to pull
Each section may have a number of commands to further explain how to index the data.
Probably the easiest way to understand a template is to look at an example. Let's say a web site contains a section that looks like this:
The following headstones were found in Main Line Cemetery:
Smith, John, b 1832, d 12 Oct 1876
Smith, Betty, b. Nov 1835, d. 1885
Jones, Paul, 1818, 1891
For more information, contact Ed at ed@hotmail.com
By looking at the web page, it's pretty easy to see there are cemetery records for three people. For each person, the record contains the last name, first name, date of birth and date of death, all separated by commas. This will be a piece of cake for Nabjet to index! Let's look at a template that could be used to index this:
| Template: | Explaination: |
[START] skippast Cemetery: [END] skipto For more information [RECORD] Type Separator Separator , [FIELD] Fieldname LastName Column 1 [FIELD] Fieldname Firstname Column 2 [FIELD] Fieldname Birthyear Column 3 [FIELD] Fieldname Deathyear Column 4 |
The data starts just after the word Cemetery: The data ends before the For more information line These records have separator characters between them The separator character is a comma The first part is the last name The second part is the first name The third part contains birth year And the fourth part contains the death year |
That's it! These 25 lines describe everything that Nabjet needs to index these cemetery records. That may seem like a lot of work for just three records, but what if there were 1,000 records on that page? The template wouldn't need to change at all.
It is important to note that the field names listed, like LastName, are considered keywords. That means that there are a set number of field name keywords that you can use. The list of field names that Nabjet currently recognizes are:
- Firstname
- Lastname
- MiddleName
- BirthYear
- DeathYear
- OtherYear
- City
- County
- State
- Country
- Gender - must be M or F.
- Nothing
That should give you a basic understanding of what should go into a template. What follows is a detailed explanation of each of the possible template commands for each section.
[START] section
The [START] section describes where to find the start of the data to be indexed. This section is optional. If it is not included in the template, the start is assumed to be the beginning of the web page. The following commands can be used in [START] section:
Skipto some text
This command skips from the current position to the string found after the
Skipto command. In this example, it will skip until it finds the words
"some text". Note that quotes are not necessary. However, if you
have a string that starts with spaces, you can enclose the string in double
quotes.
Skippast some text
The same as the Skipto command, but sets the current position just past the text
found.
Note that you can enter multiple skipto and skippast
commands in the [Start] and [End] sections
[END] section
The [END] section describes where to find the end of the data to be indexed. You should use the same skipto and skippast commands as described above. The first skipto and skippast will start from the position defined in the [START] section.
This section is optional. If it is not included in the template, the end is assumed to be the end of the web page.
[RECORD] section
The [RECORD] section describes how each record is formatted. It is important for Nabjet to know if all the fields are in fixed columns, if they are separated by some character, or if they are in HTML tables. The following are the commands that can be used in this section:
Type typevalue
This command tells what type of record to index. Replace typevalue with
one of the following:
- FIXED - if the data is in fixed columns. Used often on pages with <PRE> and </PRE> tags
- SEPARATOR - used when the fields are separated by a particular character, like a comma
- TABLE - used if the data is contained in HTML tables
- HCARD - used to index hCard values from a page. For more information on the hCard format, please see http://microformats.org/wiki/hcard.
Special Note on hCard Data
If you specify the HCARD type, you do not need to include [START], [END] or [FIELD] sections. By default, NabJet will index every hCard from a web page. The minimum Template for an hCard would include just two lines:
[RECORD]
Type HCARD
You can still use [FIELD] definitions for HCARD types if you need to exclude some records. For example, if you only want to only index hCard records with a City, you could add the following:
[FIELD]
FieldName City
IgnoreIfBlank
NabJet will extract the following standard hCard fields: 'n' and 'fn' (name fields), 'bday', 'adr' ('locality', 'region', 'country-name') plus 'gender', 'dday' which are proposed extensions to the hCard format. The program will also extract 'county-name' from the address if it exists, even though it is a non-standard field.
Separator separatorvalue
If you use type SEPARATOR, you should specify what character separates the
fields. For example "Separator ,". This line is
optional and defaults to a comma separator.
Linebreak linebreakstring
This defines where Nabjet should assume the line breaks are. For example "Linebreak
<br>" will assume that each record (or line) starts after a <br> tag.
This is optional, and defaults to a new line character.
IgnoreBetween string1 string2
There may be special cases where a section of text should be ignored for every
record. For example, everything between the tags <H3> and </H3>.
This command will remove everything between those two strings, as well as
string1 and string2.
[FIELD] section
The [FIELD] section is different than the other sections in that you will usually have more than one of them. A separate [FIELD] section is required for each data field to be indexed.
Fieldname fieldname
There are a fixed set of field names that Nabjet will index. See above for
the complete list
Note that Nabjet does not store full dates. Instead, it only stores years. In general, that is sufficient for searching. It also makes it easier to index since there are so many date formats out there. Notice in our example, one of the death dates was "d 12 Oct 1876" and another was "d. 1885". Nabjet is able to pull out the 4 digit year from each of these.
You should know there is a special field called nothing. It's primary purpose is to provide a way to ignore records that could not be ignored any other way. For example, let's say we have the following data:
BARTON, James McG. b. d. 1849 BAUM, Elizabeth b, d. 11-Aug-1867 BEATTY, James b. d. 1827 (Cumberland Co. Militia - Rev. War) BEATTY, Thomas b. d. 1830 BEECH, Charles b. d. 23-Jul-1965 - 90 yrs
This is a FIXED format data file, so you would most likely pull the name fields from the first 24 characters. But line #4 would cause a problem, with the program assuming " (Cumberland Co. Militi" is the name field. By using the following field definition, you can tell NabJet to skip any line where there isn't anything in the first two characters of the line:
[FIELD]
FieldName Nothing
start 1
length 2
IgnoreIfBlank
Column col_num
For record types of SEPARATOR or TABLE, this is the column number of the data
for each record.
AfterColumn col_num
For record types of SEPARATOR, this extracts everything after the column number given to use
for the field. You would typically use this when a field separator is used
in the middle of the data you want to extract. Let's look at the following
example:
Borkon, Louis Yale, 04 Jul 1895 - 20 Feb 1975, (contributed by Rich Boyer)
Borkon, Ruth Ashinsky, 18 Jan 1899 - 21 May 1990, (contributed by Rich Boyer)
Caplan, Jacob, (view 2 , 3), d. 26 Jan 1939, age 35Y, (contributed by Ellis Michaels)
For this example, we can use a separator of a comma, but notice that on the 3rd line, there are extra commas before the date range. To get around this problem, we can use an "AfterColumn 2" command, combined with the the BirthYearFromRange and DeathYearFromRange transform commands. The following field definitions would correctly index this example:
[FIELD]
FieldName LastName
Column 1
IgnoreIfBlank
[FIELD]
FieldName FirstName
Column 2
IgnoreIfBlank
[FIELD]
FieldName BirthYear
AfterColumn 2
Transform BirthYearFromRange
[FIELD]
FieldName DeathYear
AfterColumn 2
Transform DeathYearFromRange
Start position
This command is only used if the record type is FIXED. It defines the
starting position of this field.
Length position
This command is only used if the record type is FIXED. It defines the
length of the data for this field.
IgnoreIfEmpty
If the data for this field is empty, ignore the whole record and don't index it.
This is useful in cases where you may have a first name but no last name.
Put the ignoreifempty command on the Last Name field and only records with last
names will be indexed.
Transform transform_option
This is a special command to transform the data for this field into something
else. The following transform options are available:
- FirstWord - takes the first word of the string
- FirstWordNumber - takes the first word of the string, but leaves numeric characters like periods and minus signs
- SecondWord - takes the second word of the string
- SecondWordNumber - takes the second word of the string, but leaves numeric characters like periods and minus signs
- LastWord - takes the last word of the string
- AfterFirstWord - takes everything after the first word of the string
- BirthYearFromAge - assumes this field is an age, subtracts it from the source year to calculate birth year
- BirthYearFromRange - if this field includes a range of dates, this will extract the birth year. Note that it is smart enough to extract birth dates from strings like "b. 1/1/1958 - d. 1990", and knows that a string like "d. 1-Jan-2000" is a death date rather than a birth date.
- DeathYearFromRange - if this field includes a range of dates, this will extract the death year. Works like BirthYearFromRange
- Value - sets the field to a specific value. For example 'Transform Value XYZ' will set the value to XYZ
- RemoveSpaces - removes all spaces from the string.
- Append - append a string at the end of the data
- Prepend - prepend a string at the beginning of the data
There are many cases where the Transform command is necessary. For example, let's say you had the following data from a web page:
The following headstones were found in Main Line Cemetery:
John Smith , b 1832, d 12 Oct 1876
Betty Smith, b. Nov 1835, d. 1885
George P Jones, 1818, 1891
For more information, contact Ed at ed@hotmail.com
Notice that the first and last names are not separated by commas. Instead, use the transform command to pull the first word into the firstname field, and the last word into the lastname field. The field sections might look like this:
[FIELD]
FieldName Firstname
Column 1
Transform Firstword
[FIELD]
Fieldname Lastname
Column 1
Transform Lastword
[Field]
Fieldname BirthYear
Column 2
[Field]
Fieldname DeathYear
Column 3
Notice that firstname and lastname both use column 1 for their data, but use the transform command to pull either the first word or last word.
That's about it for now. As more commands are made available, I'll update this documentation.
