Solr Schema for Extract Features v2.0
=====================================
The Solr search interface to the 5.9+ billion page Part-of-Speech
(POS) and language tagged HTRC Extracted Features dataset can be found
at:
https://solr2.htrc.illinois.edu/solr-ef20/index.html
A developer's version is available through:
https://solr2.htrc.illinois.edu/solr-ef20/index.html
(The developers version is where new features are tested and also
where more detailed technical information can be found)
The raw (underlying) Solr API can be accessed through:
https://solr2.htrc.illinois.edu/robust-solr8
While the majority of the admin API is password restricted,
searching is publicly open, for example:
https://solr2.htrc.illinois.edu/robust-solr8/solr345678-faceted-htrc-full-ef2-shards24x2/select?q=title_t%3A*
This README document details the Solr Schema developed for the
Workset Builder 2.0 interface operating with the Extracted
Features 2.0 JSON format. In shortened form, this interface
is at times referred to as the Solr-EF 2.0 Search interface.
The Schema used makes heavy use of dynamic fields. First we describe
the page-level full-text indexing part of searching, and then move
onto how volume-level metadata is blended in with this.
1. Indexing the language and POS tagged full text
=================================================
In terms of how the Solr index is structured, we take the default
schema that ships with Solr 7/8, and make the additions/changes
detailed below.
At the base level we have the types:
Given the volume of text on the page that is being pushed through, we
have opted not to have these stored.
On top of this, we then have the dynamicField:
No stemming, but text is case-folded.
For every page in the JSON Extracted Features files, the text is
tagged with language (determined by an OpenNLP pass earlier in the
production of these JSON files). In the Solr index built, we use this
to map to language specific fields:
fr_htrctokentext
es_htrctokentext
...
In the case of 6 languages, OpenNLP models existed for Part-of-Speech
tagging, which were also run against the pages of text. The languages
were:
English, French, German, Spanish+Castilian, Arabic, Mainland-Chinese
For JSON Extracted Feature files with pages in the above 6 languages,
the individual words are further tagged with POS information. This
information is mapped into language+POS fields such as:
en_NOUN_htrctokentext
en_ADJ_htrctokentext
...
de_NOUN_htrctokentext
...
The OpenNLP tagger for language uses the 2-letter ISO standard for languages.
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
In our fields we make the 2-letter codes lowercase. The OpenNLP
language models used extend the country codes (slightly) to use zh-cn
and zh-tw respectively for Mainland-China and Taiwanese. From what we
can tell this is actually indicative of whether the classified text is
simplified or traditional Chinese text.
We follow the universal POS tag set developed by Google:
https://github.com/slavpetrov/universal-pos-tags
2. Volme-level Metadata
=======================
The Extracted Features JSON files also include volume level metadata.
This is mapped to a variety of fields to support various features in
the Solr-EF search interface. The key definitions are:
So:
'_t' when we need things to be tokenized and stored
'_txt' when we need things to be tokenized, but not stored
'_s' and '_ss' when we do not want it tokenized (see facets below);
former for when there is a single term, latter for multiple terms.
2.1 Straight out volume-level metadata search
=============================================
To support volume-level search, we map metadata to fields such as:
title_t
pubPlaceName_t
pubDate_t
...
In the case of "pubDate" this is also indexed as an integer, using
the dynamic integer field:
pubDate_i
so number range searches can be issued.
These fields are stored in the index, so the interface can retrieve
these values and present them to the user in the results set that is
produced.
2.2. Combined volume-level metadata and page-level text search
==============================================================
To support this sort of combined search, every page to a volume also
indexes what its volume-level metadata is. We map these to non-stored
values:
volumetitle_txt
volumepubPlaceName_txt
...
2.3. Facets
===========
To implement faceted search (e.g., filter by genre), we want
non-tokenized volume-level fields. These go by the form:
rightsAttribute_s
genre_ss
The '_s' is for when there can be only one value (such as its
copyright status), and '_ss' when there can be more than one value (as
in genre). We want it to be a string rather than tokenized text so,
for example, restricting a facet to 'fiction' doesn't trigger matches
in items tagged as 'non fiction'.
You will find some examples of direct calls to the Solr Search API
that pull all this together at:
https://solr2.ischool.illinois.edu/solr-ef/solr-ef20-query-api.txt