saos-tm.extractor.common
conv-html-to-text
(conv-html-to-text s)
Removes all html tags from text in string s
. Uses Tika html parser under the hood.
get-closest-regex-match
(get-closest-regex-match regexes following-text-regex s)
Takes every regex in regexes
and joins it with following-text-regex
to form a regex. Then it looks for regex match from these that appears first in text.
get-closest-regex-match-case-ins
(get-closest-regex-match-case-ins regexes following-text-regex s)
Case insensitive version of get-closest-regex-match
get-closest-regex-match-case-sen
(get-closest-regex-match-case-sen regexes following-text-regex s)
Case sensitive version of get-closest-regex-match
get-regex-matches-with-starts-ends-maps
(get-regex-matches-with-starts-ends-maps re s)
Returns matches for re
regex in s
string with their start and end positions. Return a list of maps with keys:
:start
- starting position of match:end
- end position of match:match
- the match itself
get-regex-matches-with-starts-ends-sorted
(get-regex-matches-with-starts-ends-sorted re s)
The same as get-regex-matches-with-starts-ends-maps
, but sorted
preprocess
(preprocess s)
Texts preprocessing function:
- unsplits words across lines
- removes html tags
- removes hard spaces
- converts newlines to spaces
- converts double spaces to single
replace-several
(replace-several content & replacements)
replace several elements in string
Example:
(replace-several "aaabbbccc" #"a" "" #"b" "d")
"dddccc"
sort-regexes
(sort-regexes regexes end-indicator)
Function for sorting collection of regexes in regexes
extracted by function get-regex-matches-with-starts-ends-maps
. end-indicator
can have :start
or :end
value, depending on which end we want to use in sorting.