Skip to content

ScalaMUD – Consuming Java from Scala and NLP Tagging

by Kevin Hoffman on February 15th, 2012

Last night I upgraded ScalaMUD’s POM file to point to the recently-available Akka 2.0-RC1. I was previously using M3 an was happy to note that all of my Akka 2.0 code continued working just fine without change from M3 to RC1. If the Akka RC is like most other RCs then there should be no further API changes, only fixes and tightening.

While I had the MUD code open I decided to start working on the problem of accepting player input. Sure, I have a socket reader that accepts text from players but what does one do with this text?

In the old days, I would’ve tokenized the string. By tokenized here I mean just splitting it blindly on spaces. Then I would have considered the first word in the array to be the verb and then dispatched the remaining parameters to some function in the MUD code that knows how to respond to that verb. For example, if I typed kill dragon then I would’ve tagged kill as the verb and dragon as the “rest”. This would have eventually found its way to some kill() method on a player that takes an array of strings as parameters.

Thankfully this isn’t the old days.

Instead what I did was declare a Maven dependency on the Stanford NLP (Natural Language Processing) project. To be specific, I wanted to use the Stanford non-linear Parts of Speech tagger. Why should I deal with parsing strings in a dumb way when someone else has spent years creating a powerful, well-trained NLP engine that can tag every sentence my player types with parts of speech?

This way, instead of relying on forcing players to type in pidgin dialects (e.g. kill dragon or cast spell or move north) I can let them type complete English sentences (if they want). I will then tag those sentences with the appropriate parts of speech and infer what they wanted to do from that.

I want this parsing to take place in the background. Once the player’s sentence has been enriched with parts of speech, I want to send the enriched sentence back to whatever typed it so that the command can be dispatched. To do this, I created a new Actor called Commander:

package com.kotancode.scalamud.core

import akka.actor._
import akka.routing._

import com.kotancode.scalamud.core.lang.EnrichedWord
import java.util.ArrayList
import edu.stanford.nlp.ling.Sentence
import edu.stanford.nlp.ling.TaggedWord
import edu.stanford.nlp.ling.HasWord
import edu.stanford.nlp.tagger.maxent.MaxentTagger
import scala.collection.JavaConverters._

class Commander extends Actor {
	def receive = {
		case s:String => {
			val words = s.split(" ");
			val wordList = new java.util.ArrayList[String]();
			for (elem <- words) wordList.add(elem)
		    val sentence = Sentence.toWordList(wordList);
		    val taggedSentence = Commander.tagger.tagSentence(sentence).asScala.toList

			var enrichedWords = new ArrayList[EnrichedWord]
		    for (tw : TaggedWord <- taggedSentence) {
		//		println(tw.value + "/" + tw.tag)
				val ew = EnrichedWord(tw)
				println(ew)
				enrichedWords.add(ew)
			}
	}
  }
}

object Commander {
	val tagger = new MaxentTagger("models/english-bidirectional-distsim.tagger")
}

At this point I’m just building the array of enriched words and I’m not actually sending the command back to the player (I’ll do that tonight or tomorrow, time permitting .. as always, you can check out the GitHub repo for the latest changes to the MUD). One of the interesting bits here is how I’m using a Java library from Scala. This is usually a pretty painless task but sometimes there are issues. In this case, the Stanford NLP library class Sentence has a bunch of overloads for the toWordList method. Java knows how to pick which overload but Scala doesn’t if I just use type inference and default Scala types. To get it to pick the right toWordList method I had to manually construct an ArrayList[String] because passing an Array[String] doesn’t let Scala know which overload to pick. It’s a little annoying but if I can keep the Scala->Java bridge points like this minimal then it’s not bad.

The flip side of this is that I’m getting back a regular Java array list in response to toWordList, which doesn’t support pretty Scala-native iteration because it doesn’t contain a foreach method, which is the underpinning that supports all the syntactic sugar around iteration. To deal with that, I imported the Java converters package implicits so that I could get the asScala function, which lets me call toList, which gives me a nice Scala list that I can use for easy iteration.

Here’s some sample output when I connect to the MUD and enter a sample sentence, which is then tokenized and tagged with parts of speech:

[EnrichedWord: word=attack, tag=VB, pos=Verb]
[EnrichedWord: word=the, tag=DT, pos=DontCare]
[EnrichedWord: word=green, tag=JJ, pos=Adjective]
[EnrichedWord: word=dragon, tag=NN, pos=Noun]
[EnrichedWord: word=with, tag=IN, pos=DontCare]
[EnrichedWord: word=the, tag=DT, pos=DontCare]
[EnrichedWord: word=yellow, tag=JJ, pos=Adjective]
[EnrichedWord: word=sword, tag=NN, pos=Noun]

The real goal here is that I will be taking the tagged nouns in the sentence and scanning through the player’s inventory and the environment in which the player stands for objects which have names that match the nouns and then using adjectives to disambiguate them if collisions occur. That way, when I type “kick blue bottle” I will be able to scan the surroundings for objects called “bottle” and if I find more than one, I’ll only gather up the ones that are blue.

In case you’re wondering what the EnrichedWord class looks like, which has some helper code that identifies only the parts of speech I care about, here it is:

package com.kotancode.scalamud.core.lang

import edu.stanford.nlp.ling.TaggedWord

case class PartOfSpeech
case object Noun extends PartOfSpeech
case object Verb extends PartOfSpeech
case object Adjective extends PartOfSpeech
case object DontCare extends PartOfSpeech

class EnrichedWord(value:String, tag:String, val pos:PartOfSpeech) extends TaggedWord(value, tag) {

	override def toString = "[EnrichedWord: word=" + value +", tag=" + tag + ", pos=" + pos + "]"
}

object EnrichedWord {
	def apply(hw: TaggedWord) = {
		val ew = new EnrichedWord(hw.value, hw.tag, rootTypeOf(hw.tag))
		ew
	}

	def rootTypeOf(s:String) = {
		s match {
			case "VB" | "VBD" | "VBG" | "VBN" | "VBP" | "VBZ" => Verb
			case  "NN" | "NNS" | "NNP" | "NNPS" => Noun
			case "JJ" | "JJR" | "JJS" => Adjective
			case _ => DontCare
		}
	}
}

Note that the EnrichedWord Scala class inherits from the Stanford TaggedWord Java class.

I really, really, love the syntax of the string pattern matcher I use to obtain the POS root (adjective, verb, noun, don’t care) from the Penn Treebank Tags that are used by the Stanford NLP POS tagger.

The takeaway I got from this exercise is further reinforcement of my rule to never re-invent the wheel because there are wheel experts out there who have dedicated their lives and careers to building wheels more awesome than I could ever hope to build. Hence I declare a Maven dependency on the NLP library and in a single night, I’ve got a MUD that can intelligently POS-tag player sentences which I can then use to identify potential targets of player commands. In addition, I don’t have to create a pidgin dialect for interacting with the MUD. English works and the MUD should be able to deal with “I kill the dragon with the blue sword because I am the shizzle” with the same ease as “kill dragon with sword”.

From → Languages, Technology

  • George Moschovitis

    Thanks for the pointer to that NLP library, great article series btw!

  • http://cygal.myopenid.com/ Quentin

    I started doing this.

    Then I said to myself: “What if I use another word which also means kill?”. It was OK, since I could add a list of verbs with the same meanings.

    Then I wondered “What if the meaning of the verb used is different? (eg. another meaning of kill.v http://wordnetweb.princeton.edu/perl/webwn?s=kill or its synonyms) And what if I want to allow more elaborate constructs such as “Don’t kill the pony but destroy that orc!”.

    Then I started a PhD on Word Sense Disambiguation and Semantic Role Labeling.

  • Michael

    Has Stanford NLP fixed their threading issues?

    I had to discount it a couple of years ago when looking for a JVM NLP library as the models weren’t thread safe meaning I had to serialize access to a loaded model, in the end I had to go with OpenNLP.

    • Kevin Hoffman

      Their threading issues, if they still exist, don’t really bother me because each text command is processed by a separate queue on a separate thread by virtue of Akka message passing semantics.