npm install compromise
↬ᔐᖜ↬ and how hard it is to actually parse and use?
import nlp from 'compromise'
let doc = nlp('she sells seashells by the seashore.')
doc.verbs().toPastTense()
doc.text()
// 'she sold seashells by the seashore.'
if (doc.has('simon says #Verb')) {
return true
}
let doc = nlp(entireNovel)
doc.match('the #Adjective of times').text()
// "the blurst of times?"
and get data:
import plg from 'compromise-speech'
nlp.extend(plg)
let doc = nlp('Milwaukee has certainly had its share of visitors..')
doc.compute('syllables')
doc.places().json()
/*
[{
"text": "Milwaukee",
"terms": [{
"normal": "milwaukee",
"syllables": ["mil", "wau", "kee"]
}]
}]
*/
avoid the problems of brittle parsers:
let doc = nlp("we're not gonna take it..")
doc.has('gonna') // true
doc.has('going to') // true (implicit)
// transform
doc.contractions().expand()
doc.text()
// 'we are not going to take it..'
and whip stuff around like it's data:
let doc = nlp('ninety five thousand and fifty two')
doc.numbers().add(20)
doc.text()
// 'ninety five thousand and seventy two'
-because it actually is-
let doc = nlp('the purple dinosaur')
doc.nouns().toPlural()
doc.text()
// 'the purple dinosaurs'
Use it on the client-side:
<script src="https://unpkg.com/compromise"></script>
<script>
var doc = nlp('two bottles of beer')
doc.numbers().minus(1)
document.body.innerHTML = doc.text()
// 'one bottle of beer'
</script>
or likewise:
import nlp from 'compromise'
var doc = nlp('London is calling')
doc.verbs().toNegative()
// 'London is not calling'
compromise is ~250kb (minified):
it's pretty fast. It can run on keypress:
it works mainly by conjugating all forms of a basic word list.
The final lexicon is ~14,000 words:
you can read more about how it works, here. it's weird.
okay -
compromise/one
A tokenizer
of words, sentences, and punctuation.
import nlp from 'compromise/one'
let doc = nlp("Wayne's World, party time")
let data = doc.json()
/* [{
normal:"wayne's world party time",
terms:[{ text: "Wayne's", normal: "wayne" },
...
]
}]
*/
compromise/one splits your text up, wraps it in a handy API,
/one is quick - most sentences take a 10th of a millisecond.
It can do ~1mb of text a second - or 10 wikipedia pages.
Infinite jest takes 3s.
compromise/two
A part-of-speech
tagger, and grammar-interpreter.
import nlp from 'compromise/two'
let doc = nlp("Wayne's World, party time")
let str = doc.match('#Possessive #Noun').text()
// "Wayne's World"
this is more useful than people sometimes realize.
Light grammar helps you write cleaner templates, and get closer to the information.
compromise has 83 tags, arranged in a handsome graph.
#FirstName → #Person → #ProperNoun → #Noun
you can see the grammar of each word by running doc.debug()
you can see the reasoning for each tag with nlp.verbose('tagger')
.
if you prefer Penn tags, you can derive them with:
let doc = nlp('welcome thrillho')
doc.compute('penn')
doc.json()
compromise/three
Phrase
and sentence tooling.
import nlp from 'compromise/three'
let doc = nlp("Wayne's World, party time")
let str = doc.people().normalize().text()
// "wayne"
compromise/three is a set of tooling to zoom into and operate on parts of a text.
.numbers()
grabs all the numbers in a document, for example - and extends it with new methods, like .subtract()
.
When you have a phrase, or group of words, you can see additional metadata about it with .json()
let doc = nlp('four out of five dentists')
console.log(doc.fractions().json())
/*[{
text: 'four out of five',
terms: [ [Object], [Object], [Object], [Object] ],
fraction: { numerator: 4, denominator: 5, decimal: 0.8 }
}
]*/
let doc = nlp('$4.09CAD')
doc.money().json()
/*[{
text: '$4.09CAD',
terms: [ [Object] ],
number: { prefix: '$', num: 4.09, suffix: 'cad'}
}
]*/
(match methods use the match-syntax.)
(these methods are on the main nlp
object)
nlp.tokenize(str) - parse text without running POS-tagging
nlp.lazy(str, match) - scan through a text with minimal analysis
nlp.plugin({}) - mix in a compromise-plugin
nlp.parseMatch(str) - pre-parse any match statements into json
nlp.world() - grab or change library internals
nlp.model() - grab all current linguistic data
nlp.methods() - grab or change internal methods
nlp.hooks() - see which compute methods run automatically
nlp.verbose(mode) - log our decision-making for debugging
nlp.version - current semver version of the library
nlp.addWords(obj, isFrozen?) - add new words to the lexicon
nlp.addTags(obj) - add new tags to the tagSet
nlp.typeahead(arr) - add words to the auto-fill dictionary
nlp.buildTrie(arr) - compile a list of words into a fast lookup form
nlp.buildNet(arr) - compile a list of matches into a fast match form
'football captain' → 'football captains'
'turnovers' → 'turnover'
'will go' → 'went'
'walked' → 'walks'
'walked' → 'will walk'
'walks' → 'walk'
'walks' → 'walking'
'drive' → 'had driven'
'went' → 'did not go'
"didn't study" → 'studied'
5
five
fifth
or 5th
five
or 5
'$2.50'
he walks
-> he walked
he walked
-> he walks
he walks
-> he will walk
he walks
-> he walk
he walks
-> he didn't walk
?
!
?
or !
'quick'
'wash-out'
'(939) 555-0113'
'#nlp'
'hi@compromise.cool'
:)
💋
'@nlp_compromise'
'compromise.cool'
'he'
'but'
'of'
'Mrs.'
people()
+ places()
+ organizations()
'quickly'
'FBI'
"Spencer's"
This library comes with a considerate, common-sense baseline for english grammar.
You're free to change, or lay-waste to any settings - which is the fun part actually.
the easiest part is just to suggest tags for any given words:
let myWords = {
kermit: 'FirstName',
fozzie: 'FirstName',
}
let doc = nlp(muppetText, myWords)
or make heavier changes with a compromise-plugin.
import nlp from 'compromise'
nlp.extend({
// add new tags
tags: {
Character: {
isA: 'Person',
notA: 'Adjective',
},
},
// add or change words in the lexicon
words: {
kermit: 'Character',
gonzo: 'Character',
},
// change inflections
irregulars: {
get: {
pastTense: 'gotten',
gerund: 'gettin',
},
},
// add new methods to compromise
api: View => {
View.prototype.kermitVoice = function () {
this.sentences().prepend('well,')
this.match('i [(am|was)]').prepend('um,')
return this
}
},
})
| Concepts | API | Plugins | | ------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------: | -------------------------------------------------------------------------------------: | | Accuracy | Accessors | Adjectives | | Caching | Constructor-methods | Dates | | Case | Contractions | Export | | Filesize | Insert | Hash | | Internals | Json | Html | | Justification | Character Offsets | Keypress | | Lexicon | Loops | Ngrams | | Match-syntax | Match | Numbers | | Performance | Nouns | Paragraphs | | Plugins | Output | Scan | | Projects | Selections | Sentences | | Tagger | Sorting | Syllables | | Tags | Split | Pronounce | | Tokenization | Text | Strict | | Named-Entities | Utils | Penn-tags | | Whitespace | Verbs | Typeahead | | World data | Normalization | Sweep | | Fuzzy-matching | Typescript | Mutation | | Root-forms |
These are some helpful extensions:
npm install compromise-dates
June 8th
or 03/03/18
2 weeks
or 5mins
4:30pm
or half past five
npm install compromise-stats
.tfidf({}) - rank words by frequency and uniqueness
.ngrams({}) - list all repeating sub-phrases, by word-count
.unigrams() - n-grams with one word
.bigrams() - n-grams with two words
.trigrams() - n-grams with three words
.startgrams() - n-grams including the first term of a phrase
.endgrams() - n-grams including the last term of a phrase
.edgegrams() - n-grams including the first or last term of a phrase
npm install compromise-syllables
npm install compromise-wikipedia
we're committed to typescript/deno support, both in main and in the official-plugins:
import nlp from 'compromise'
import stats from 'compromise-stats'
const nlpEx = nlp.extend(stats)
nlpEx('This is type safe!').ngrams({ min: 1 })
slash-support:
We currently split slashes up as different words, like we do for hyphens. so things like this don't work:
nlp('the koala eats/shoots/leaves').has('koala leaves') //false
inter-sentence match:
By default, sentences are the top-level abstraction.
Inter-sentence, or multi-sentence matches aren't supported without a plugin:
nlp("that's it. Back to Winnipeg!").has('it back')//false
nested match syntax:
the danger beauty of regex is that you can recurse indefinitely.
Our match syntax is much weaker. Things like this are not (yet) possible:
doc.match('(modern (major|minor))? general')
complex matches must be achieved with successive .match() statements.
dependency parsing: Proper sentence transformation requires understanding the syntax tree of a sentence, which we don't currently do. We should! Help wanted with this.
en-pos - very clever javascript pos-tagger by Alex Corvi
naturalNode - fancier statistical nlp in javascript
winkJS - POS-tagger, tokenizer, machine-learning in javascript
dariusk/pos-js - fastTag fork in javascript
compendium-js - POS and sentiment analysis in javascript
nodeBox linguistics - conjugation, inflection in javascript
reText - very impressive text utilities in javascript
superScript - conversation engine in js
jsPos - javascript build of the time-tested Brill-tagger
spaCy - speedy, multilingual tagger in C/python
Prose - quick tagger in Go by Joseph Kato
TextBlob - python tagger
MIT