Aspell Lookup Table

    All data records stored in this lookup table are kept in memory. For this reason, to store all data records from the lookup table, sufficient memory must be available. If data records are loaded to Aspell lookup table from a data file, the size of available memory should be approximately at least 7 times bigger than that of the data file. However, this multiplier is different for different types of data records stored in the data file.

    If you are working with data records that are similar but not fully identical, you should use this type of lookup table. For example, you can use Aspell lookup table for addresses.

    Aspell lookup table allows you to have multiple records with the same key value.

    Creating Aspell Lookup Table

    In the Aspell lookup table wizard, you set up the required properties. You must give a Name to the lookup table, select the corresponding Metadata, select the Lookup key field that should be used to look up data records from the table (must be of string data type).

    You can also specify the Data file URL where the data records of the lookup table will be stored and the charset of data file (Data file charset). The default charset is UTF-8.

    You can set the threshold that should be used by the lookup table (Spelling threshold). It must be higher than 0. The higher the threshold, the more tolerant is the component to spelling errors. Its default value is 230. It is the edit_distance value from the query to the results. Words with this value higher that the specified limit are not included in the results.

    You can also change the default costs of individual operations (Edit costs):

    • Case cost

      Used when the case of one character is changed.

    • Transpose cost

      Used when one character is transposed with another in the string.

    • Delete cost

      Used when one character is deleted from the string.

    • Insert cost

      Used when one character is inserted to the string.

    • Replace cost

      Used when one character is replaced by another one.

    You need to decide whether the letters with diacritic marks are considered identical with those without these marks. To do that, you need to set the value of the Remove diacritic marks attribute. If you want diacritic marks to be removed before computing the edit_distance value, you need to set this value to true. This way, letters with diacritic marks are considered equal to their Latin equivalents. (Default value is false. By default, letters with diacritic marks are considered different from those without.)

    If you want best guesses to be included in the results, set Include best guesses to true. The default value is false. Best guesses are the words whose edit_distance value is higher than the Spelling threshold, for which there is no other better counterpart.

    Then click OK and Finish.

    Aspell Lookup Table Wizard

    Figure 34.11. Aspell Lookup Table Wizard


    If you want to know the distance between the lookup table and edge values, you must add another field of numeric type to lookup table metadata. Set this field to Autofilling (default_value).

    Select this field in the Edit distance field combo.

    When you are using Aspell lookup table in LookupJoin, you can map this lookup table field to corresponding field on the output port 0.

    This way, values that will be stored in the specified Edit distance field of lookup table will be sent to the output to another specified field.