saos-tm.extractor.common
conv-html-to-text
(conv-html-to-text s)Removes all html tags from text in string s. Uses Tika html parser under the hood.
get-closest-regex-match
(get-closest-regex-match regexes following-text-regex s)Takes every regex in regexes and joins it with following-text-regex to form a regex. Then it looks for regex match from these that appears first in text.
get-closest-regex-match-case-ins
(get-closest-regex-match-case-ins regexes following-text-regex s)Case insensitive version of get-closest-regex-match
get-closest-regex-match-case-sen
(get-closest-regex-match-case-sen regexes following-text-regex s)Case sensitive version of get-closest-regex-match
get-regex-matches-with-starts-ends-maps
(get-regex-matches-with-starts-ends-maps re s)Returns matches for re regex in s string with their start and end positions. Return a list of maps with keys:
:start- starting position of match:end- end position of match:match- the match itself
get-regex-matches-with-starts-ends-sorted
(get-regex-matches-with-starts-ends-sorted re s)The same as get-regex-matches-with-starts-ends-maps, but sorted
preprocess
(preprocess s)Texts preprocessing function:
- unsplits words across lines
- removes html tags
- removes hard spaces
- converts newlines to spaces
- converts double spaces to single
replace-several
(replace-several content & replacements)replace several elements in string
Example:
(replace-several "aaabbbccc" #"a" "" #"b" "d")
"dddccc"
sort-regexes
(sort-regexes regexes end-indicator)Function for sorting collection of regexes in regexes extracted by function get-regex-matches-with-starts-ends-maps. end-indicator can have :start or :end value, depending on which end we want to use in sorting.