Solr Schema for Extract Features v2.0 ===================================== The Solr search interface to the 5.9+ billion page Part-of-Speech (POS) and language tagged HTRC Extracted Features dataset can be found at: https://solr2.htrc.illinois.edu/solr-ef20/index.html A developer's version is available through: https://solr2.htrc.illinois.edu/solr-ef20/index.html (The developers version is where new features are tested and also where more detailed technical information can be found) The raw (underlying) Solr API can be accessed through: https://solr2.htrc.illinois.edu/robust-solr8 While the majority of the admin API is password restricted, searching is publicly open, for example: https://solr2.htrc.illinois.edu/robust-solr8/solr345678-faceted-htrc-full-ef2-shards24x2/select?q=title_t%3A* This README document details the Solr Schema developed for the Workset Builder 2.0 interface operating with the Extracted Features 2.0 JSON format. In shortened form, this interface is at times referred to as the Solr-EF 2.0 Search interface. The Schema used makes heavy use of dynamic fields. First we describe the page-level full-text indexing part of searching, and then move onto how volume-level metadata is blended in with this. 1. Indexing the language and POS tagged full text ================================================= In terms of how the Solr index is structured, we take the default schema that ships with Solr 7/8, and make the additions/changes detailed below. At the base level we have the types: Given the volume of text on the page that is being pushed through, we have opted not to have these stored. On top of this, we then have the dynamicField: No stemming, but text is case-folded. For every page in the JSON Extracted Features files, the text is tagged with language (determined by an OpenNLP pass earlier in the production of these JSON files). In the Solr index built, we use this to map to language specific fields: fr_htrctokentext es_htrctokentext ... In the case of 6 languages, OpenNLP models existed for Part-of-Speech tagging, which were also run against the pages of text. The languages were: English, French, German, Spanish+Castilian, Arabic, Mainland-Chinese For JSON Extracted Feature files with pages in the above 6 languages, the individual words are further tagged with POS information. This information is mapped into language+POS fields such as: en_NOUN_htrctokentext en_ADJ_htrctokentext ... de_NOUN_htrctokentext ... The OpenNLP tagger for language uses the 2-letter ISO standard for languages. https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 In our fields we make the 2-letter codes lowercase. The OpenNLP language models used extend the country codes (slightly) to use zh-cn and zh-tw respectively for Mainland-China and Taiwanese. From what we can tell this is actually indicative of whether the classified text is simplified or traditional Chinese text. We follow the universal POS tag set developed by Google: https://github.com/slavpetrov/universal-pos-tags 2. Volme-level Metadata ======================= The Extracted Features JSON files also include volume level metadata. This is mapped to a variety of fields to support various features in the Solr-EF search interface. The key definitions are: So: '_t' when we need things to be tokenized and stored '_txt' when we need things to be tokenized, but not stored '_s' and '_ss' when we do not want it tokenized (see facets below); former for when there is a single term, latter for multiple terms. 2.1 Straight out volume-level metadata search ============================================= To support volume-level search, we map metadata to fields such as: title_t pubPlaceName_t pubDate_t ... In the case of "pubDate" this is also indexed as an integer, using the dynamic integer field: pubDate_i so number range searches can be issued. These fields are stored in the index, so the interface can retrieve these values and present them to the user in the results set that is produced. 2.2. Combined volume-level metadata and page-level text search ============================================================== To support this sort of combined search, every page to a volume also indexes what its volume-level metadata is. We map these to non-stored values: volumetitle_txt volumepubPlaceName_txt ... 2.3. Facets =========== To implement faceted search (e.g., filter by genre), we want non-tokenized volume-level fields. These go by the form: rightsAttribute_s genre_ss The '_s' is for when there can be only one value (such as its copyright status), and '_ss' when there can be more than one value (as in genre). We want it to be a string rather than tokenized text so, for example, restricting a facet to 'fiction' doesn't trigger matches in items tagged as 'non fiction'. You will find some examples of direct calls to the Solr Search API that pull all this together at: https://solr2.ischool.illinois.edu/solr-ef/solr-ef20-query-api.txt