public abstract class TextConversion
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
TEXT_SPLIT_REGEXP
Pattern used to split words in a string
|
Constructor and Description |
---|
TextConversion() |
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
findReplace(java.lang.CharSequence string,
java.util.regex.Pattern findPattern,
java.lang.String... replacement)
Finds all occurrences of the given pattern in the
string and
replace it with the replacement string according to the matched group. |
static java.lang.String |
getAllFieldsData(StringFieldDataProvider textFieldDataProvider,
java.lang.String fieldSeparatorString,
boolean addNullFields)
Creates a text from all fields of the
StringFieldDataProvider . |
static java.lang.String[] |
identifiersToWords(IntStorageIndexed<java.lang.String> wordIndex,
int[] ids)
Convert the given array of word identifiers to words using the given storage.
|
static java.lang.String |
join(java.lang.String[] words,
java.lang.String separator)
Returns the string created by joining the
words using the given separator . |
static java.util.Set<java.lang.String> |
loadDatabaseWords(java.lang.String dbConnUrl,
java.lang.String tableName,
java.lang.String columnName)
Load a set of words from a given table.
|
static java.util.List<java.lang.String> |
matchingWords(java.util.Collection<java.lang.String> words,
java.util.regex.Pattern pattern)
Creates a list of words that match the given regular expression.
|
static StringFieldDataProvider |
metaobjectToTextProvider(MetaObject metaObject)
Converts the given
MetaObject to a StringFieldDataProvider
using the encapsulated objects that implement the StringDataProvider . |
static java.lang.String[] |
normalizeAndSplitString(java.lang.String string)
The string is
normalized and
then split into separate words by any sequence of non-alphanumeric characters. |
static java.lang.String[] |
normalizeAndSplitString(java.lang.String string,
java.lang.String stringSplitRegexp)
The string is
normalized and
then split into separate words by the given stringSplitRegexp . |
static java.lang.String |
normalizeString(java.lang.String string)
Normalizes the given string by lower-casing, replacing the diacritics characters
with their Latin counter-parts, and removing/replacing other unwanted character sequences.
|
static java.util.Set<java.lang.String> |
stemWords(java.util.Collection<java.lang.String> words,
Stemmer stemmer)
Processes the given collection of words by stemming.
|
static int[][] |
textsToWordIdentifiersMultiIndex(java.lang.String[] strings,
java.lang.String stringSplitRegexp,
java.util.Set<java.lang.String> stopWords,
Stemmer stemmer,
IntStorageIndexed<java.lang.String> writableWordIndex,
IntStorageIndexed<java.lang.String>[] readonlyWordIndexes)
Transforms multiple strings of words into multi-array of addresses.
|
static int[] |
textToWordIdentifiers(java.lang.String string,
java.lang.String stringSplitRegexp,
java.util.Set<java.lang.String> ignoreWords,
java.util.Set<java.lang.String> stopWords,
Stemmer stemmer,
IntStorageIndexed<java.lang.String> wordIndex)
Transforms a string of words into array of addresses.
|
static int[] |
textToWordIdentifiers(java.lang.String string,
java.lang.String stringSplitRegexp,
java.util.Set<java.lang.String> ignoreWords,
java.util.Set<java.lang.String> stopWords,
WordExpander expander,
Stemmer stemmer,
IntStorageIndexed<java.lang.String> wordIndex)
Transforms a string of words into array of addresses.
|
static int[] |
textToWordIdentifiersMultiIndex(java.lang.String string,
java.lang.String stringSplitRegexp,
java.util.Set<java.lang.String> ignoreWords,
java.util.Set<java.lang.String> stopWords,
Stemmer stemmer,
IntStorageIndexed<java.lang.String> writableWordIndex,
IntStorageIndexed<java.lang.String>[] readonlyWordIndexes)
Transforms a string of words into array of addresses.
|
static java.lang.String |
unifyWord(java.lang.String keyWord,
java.util.Set<java.lang.String> ignoreWords,
java.util.Set<java.lang.String> stopWords,
Stemmer stemmer,
boolean normalize)
Return a stemmed, non-ignored word.
|
static java.util.Collection<java.lang.String> |
unifyWords(java.lang.String[] keyWords,
java.util.Set<java.lang.String> ignoreWords,
java.util.Set<java.lang.String> stopWords,
Stemmer stemmer,
boolean normalize)
Return a collection of stemmed, non-ignored words.
|
static int[] |
wordsToIdentifiers(java.lang.String[] words,
java.util.Set<java.lang.String> ignoreWords,
java.util.Set<java.lang.String> stopWords,
WordExpander expander,
Stemmer stemmer,
IntStorageIndexed<java.lang.String> wordIndex,
boolean normalize)
Transforms a list of words into array of addresses.
|
static int[] |
wordsToIdentifiers(java.lang.String[] words,
java.util.Set<java.lang.String> ignoreWords,
WordExpander expander,
Stemmer stemmer,
IntStorageIndexed<java.lang.String> wordIndex,
boolean normalize)
Transforms a list of words into array of addresses.
|
static int |
wordsToIdentifiersRead(java.util.Collection<java.lang.String> words,
IntStorageIndexed<java.lang.String> wordIndex,
int[] identifiers,
int index)
Transforms a list of words into array of addresses by reading the given word index.
|
static int |
wordsToIdentifiersStore(java.util.Collection<java.lang.String> words,
IntStorageIndexed<java.lang.String> wordIndex,
int[] identifiers,
int index)
Transforms a list of words into array of addresses by storing the words
into the given word index and retrieving the generated identifiers.
|
public static final java.lang.String TEXT_SPLIT_REGEXP
public static java.lang.String findReplace(java.lang.CharSequence string, java.util.regex.Pattern findPattern, java.lang.String... replacement)
string
and
replace it with the replacement string according to the matched group.string
- the string to apply the find-and-replace onfindPattern
- the pattern to find, must have the same number of
groups as the replacement arrayreplacement
- the list of replacement stringspublic static java.lang.String join(java.lang.String[] words, java.lang.String separator)
words
using the given separator
.words
- the strings to joinseparator
- the separator to join the strings withpublic static java.lang.String normalizeString(java.lang.String string)
string
- the string to normalizeNORMALIZER_REPLACE_PATTERN
,
NORMALIZER_REPLACE_STRINGS
public static java.lang.String[] normalizeAndSplitString(java.lang.String string, java.lang.String stringSplitRegexp)
normalized
and
then split into separate words by the given stringSplitRegexp
.string
- the string with the words to normalized and splitstringSplitRegexp
- the regular expression used to split the string into wordspublic static java.lang.String[] normalizeAndSplitString(java.lang.String string)
normalized
and
then split into separate words by any sequence of non-alphanumeric characters.string
- the string with the words to normalized and splitpublic static java.util.List<java.lang.String> matchingWords(java.util.Collection<java.lang.String> words, java.util.regex.Pattern pattern)
words
- the source words to matchpattern
- the regular expression matching valid wordspublic static java.util.Collection<java.lang.String> unifyWords(java.lang.String[] keyWords, java.util.Set<java.lang.String> ignoreWords, java.util.Set<java.lang.String> stopWords, Stemmer stemmer, boolean normalize) throws TextConversionException
ignoreWords
is not null the resulting collection
does not contain duplicate words.keyWords
- the list of keywords to transformignoreWords
- set of words to ignore (e.g. the previously added keywords);
if null, all keywords are addedstopWords
- set of words to ignore but not updatestemmer
- a Stemmer
for word transformationnormalize
- if true, each keyword is first normalized
TextConversionException
- if there was an error stemming the wordpublic static java.lang.String unifyWord(java.lang.String keyWord, java.util.Set<java.lang.String> ignoreWords, java.util.Set<java.lang.String> stopWords, Stemmer stemmer, boolean normalize) throws TextConversionException
ignoreWords
is not null the resulting collection
does not contain duplicate words.keyWord
- the keyword to transformignoreWords
- set of words to ignore (e.g. the previously added keywords);
if null, all keywords are addedstopWords
- set of words to ignore but not updatestemmer
- a Stemmer
for word transformationnormalize
- if true, the keyword is first normalized
TextConversionException
- if there was an error stemming the wordpublic static java.util.Set<java.lang.String> stemWords(java.util.Collection<java.lang.String> words, Stemmer stemmer) throws TextConversionException
words
- the words to processstemmer
- the stemmer to useTextConversionException
- if a stemming error occurredpublic static java.lang.String[] identifiersToWords(IntStorageIndexed<java.lang.String> wordIndex, int[] ids) throws TextConversionException
wordIndex
- the index used to transform the integers to wordsids
- the array of integers to convertTextConversionException
- if there was an error reading a word with a given identifier from the indexpublic static int wordsToIdentifiersRead(java.util.Collection<java.lang.String> words, IntStorageIndexed<java.lang.String> wordIndex, int[] identifiers, int index) throws TextConversionException
identifiers
array starting at index
.words
- the words to transformwordIndex
- the index for translating words to addressesidentifiers
- the destination array where the word identifiers will be putindex
- the starting index of the identifiers
array where the identifiers will be putidentifiers
array where the next identifier should be put;
it is equal to the length of the identifiers array if and only if all the words were processed and
the words
array is emptyTextConversionException
- if there was an error reading the indexpublic static int wordsToIdentifiersStore(java.util.Collection<java.lang.String> words, IntStorageIndexed<java.lang.String> wordIndex, int[] identifiers, int index) throws TextConversionException
words
- the words to transformwordIndex
- the index for translating words to addressesidentifiers
- the destination array where the word identifiers will be putindex
- the starting index of the identifiers
array where the identifiers will be putidentifiers
array where the next identifier should be put;
it is equal to the length of the identifiers array if and only if all the words were processed and
the words
array is emptyTextConversionException
- if there was an error reading the indexpublic static int[] wordsToIdentifiers(java.lang.String[] words, java.util.Set<java.lang.String> ignoreWords, java.util.Set<java.lang.String> stopWords, WordExpander expander, Stemmer stemmer, IntStorageIndexed<java.lang.String> wordIndex, boolean normalize) throws TextConversionException
words
- the list of words to transformignoreWords
- set of words to ignore (e.g. the previously added keywords);
if null, all keywords are addedstopWords
- set of words to ignore but not updateexpander
- instance for expanding the list of wordsstemmer
- a Stemmer
for word transformationwordIndex
- the index for translating words to addressesnormalize
- if true, each word is first normalized
TextConversionException
- if there was an error stemming the word or reading the indexpublic static int[] wordsToIdentifiers(java.lang.String[] words, java.util.Set<java.lang.String> ignoreWords, WordExpander expander, Stemmer stemmer, IntStorageIndexed<java.lang.String> wordIndex, boolean normalize) throws TextConversionException
words
- the list of words to transformignoreWords
- set of words to ignore (e.g. the previously added keywords);
if null, all keywords are addedexpander
- instance for expanding the list of wordsstemmer
- a Stemmer
for word transformationwordIndex
- the index for translating words to addressesnormalize
- if true, each word is first normalized
TextConversionException
- if there was an error stemming the word or reading the indexpublic static int[] textToWordIdentifiers(java.lang.String string, java.lang.String stringSplitRegexp, java.util.Set<java.lang.String> ignoreWords, java.util.Set<java.lang.String> stopWords, Stemmer stemmer, IntStorageIndexed<java.lang.String> wordIndex) throws TextConversionException
normalized and split
first, then the words are converted to identifiers
.
Note that unknown words are added to the index.string
- the string with the words to transformstringSplitRegexp
- the regular expression used to split the string into wordsignoreWords
- set of words to ignore (e.g. the previously added keywords);
if null, all keywords are addedstopWords
- set of words to ignore but not updatestemmer
- a Stemmer
for word transformationwordIndex
- the index for translating words to addressesjava.lang.IllegalStateException
- if there was a problem reading the indexTextConversionException
- if there was an error stemming the wordpublic static int[] textToWordIdentifiers(java.lang.String string, java.lang.String stringSplitRegexp, java.util.Set<java.lang.String> ignoreWords, java.util.Set<java.lang.String> stopWords, WordExpander expander, Stemmer stemmer, IntStorageIndexed<java.lang.String> wordIndex) throws TextConversionException
normalized and split
first, then the words are expanded
,
and finally all the expanded words are converted to identifiers
.
Note that unknown words are added to the index.string
- the string with the words to transformstringSplitRegexp
- the regular expression used to split the string into wordsignoreWords
- set of words to ignore (e.g. the previously added keywords);
if null, all keywords are addedstopWords
- set of words to ignore but not updateexpander
- instance for expanding the list of wordsstemmer
- a Stemmer
for word transformationwordIndex
- the index for translating words to addressesjava.lang.IllegalStateException
- if there was a problem reading the indexTextConversionException
- if there was an error expanding or stemming the wordspublic static java.util.Set<java.lang.String> loadDatabaseWords(java.lang.String dbConnUrl, java.lang.String tableName, java.lang.String columnName) throws java.sql.SQLException
dbConnUrl
- the database JDBC connection URLtableName
- the table name to get the words fromcolumnName
- the table column name to get the words fromjava.sql.SQLException
- if there was a problem communicating with the databasepublic static int[] textToWordIdentifiersMultiIndex(java.lang.String string, java.lang.String stringSplitRegexp, java.util.Set<java.lang.String> ignoreWords, java.util.Set<java.lang.String> stopWords, Stemmer stemmer, IntStorageIndexed<java.lang.String> writableWordIndex, IntStorageIndexed<java.lang.String>[] readonlyWordIndexes) throws TextConversionException
normalized and split
first, then the readonly word indexes are sequentially used to transform the words to identifiers.
The remaining words are stemmed and read from the writable word index and if not found
the are inserted.string
- the string with the words to transformstringSplitRegexp
- the regular expression used to split the string into wordsignoreWords
- set of words to ignore (e.g. the previously added keywords);
if null, all keywords are addedstopWords
- set of words to ignore but not updatestemmer
- a Stemmer
for word transformationwritableWordIndex
- the index for translating words to addresses where the unknown words can be insertedreadonlyWordIndexes
- the indexes for translating words to addressesjava.lang.IllegalStateException
- if there was a problem reading the indexTextConversionException
- if there was an error expanding or stemming the wordspublic static int[][] textsToWordIdentifiersMultiIndex(java.lang.String[] strings, java.lang.String stringSplitRegexp, java.util.Set<java.lang.String> stopWords, Stemmer stemmer, IntStorageIndexed<java.lang.String> writableWordIndex, IntStorageIndexed<java.lang.String>[] readonlyWordIndexes) throws TextConversionException
textToWordIdentifiersMultiIndex
method.strings
- the multiple strings of words to transformstringSplitRegexp
- the regular expression used to split the string into wordsstopWords
- set of words to ignore but not updatestemmer
- a Stemmer
for word transformationwritableWordIndex
- the index for translating words to addresses where the unknown words can be insertedreadonlyWordIndexes
- the indexes for translating words to addressesjava.lang.IllegalStateException
- if there was a problem reading the indexTextConversionException
- if there was an error expanding or stemming the wordspublic static java.lang.String getAllFieldsData(StringFieldDataProvider textFieldDataProvider, java.lang.String fieldSeparatorString, boolean addNullFields)
StringFieldDataProvider
.textFieldDataProvider
- the text field data provider the text of which to combinefieldSeparatorString
- the separator inserted between the text of the respective fieldsaddNullFields
- if true an empty string is added for null textual fields,
otherwise the null fields are skippedpublic static StringFieldDataProvider metaobjectToTextProvider(MetaObject metaObject)
MetaObject
to a StringFieldDataProvider
using the encapsulated objects that implement the StringDataProvider
.
Every encapsulated object represents a separate textual field, however,
null is returned for the encapsulated objects that do not implement
the StringDataProvider
interface. The combined text from all fields
uses newline as a concatenation separator and the null fields
are skipped.metaObject
- the metaobject to convert