saos-tm.extractor.common

conv-html-to-text

(conv-html-to-text s)

Removes all html tags from text in string s. Uses Tika html parser under the hood.

conv-str-to-regex

(conv-str-to-regex s)

csv-delimiter

find-first

(find-first f coll)

get-closest-regex-match

(get-closest-regex-match regexes following-text-regex s)

Takes every regex in regexes and joins it with following-text-regex to form a regex. Then it looks for regex match from these that appears first in text.

get-closest-regex-match-case-ins

(get-closest-regex-match-case-ins regexes following-text-regex s)

Case insensitive version of get-closest-regex-match

get-closest-regex-match-case-sen

(get-closest-regex-match-case-sen regexes following-text-regex s)

Case sensitive version of get-closest-regex-match

get-file-paths

(get-file-paths dir re)

get-regex-matches-with-starts-ends-maps

(get-regex-matches-with-starts-ends-maps re s)

Returns matches for re regex in s string with their start and end positions. Return a list of maps with keys:

  • :start - starting position of match
  • :end - end position of match
  • :match - the match itself

get-regex-matches-with-starts-ends-sorted

(get-regex-matches-with-starts-ends-sorted re s)

The same as get-regex-matches-with-starts-ends-maps, but sorted

indices

(indices pred coll)

matches?

(matches? s re)

not-matches?

not-nil?

pl-big-diacritics

pl-diacritics

preprocess

(preprocess s)

Texts preprocessing function:

  • unsplits words across lines
  • removes html tags
  • removes hard spaces
  • converts newlines to spaces
  • converts double spaces to single

remove-double-spaces

(remove-double-spaces s)

remove-hard-spaces

(remove-hard-spaces s)

remove-html-tags-other-than-span

(remove-html-tags-other-than-span s)

replace-several

(replace-several content & replacements)

replace several elements in string

Example:

(replace-several "aaabbbccc" #"a" "" #"b" "d")

"dddccc"

sort-regexes

(sort-regexes regexes end-indicator)

Function for sorting collection of regexes in regexes extracted by function get-regex-matches-with-starts-ends-maps. end-indicator can have :start or :end value, depending on which end we want to use in sorting.

substring?

(substring? sub st)

system-newline