KijiChopsticks provides a simple data analysis language using KijiSchema and Scalding.
KijiChopsticks requires Apache Maven 3 to build. It may built by running the command
mvn clean package
from the root of the KijiChopsticks repository. This will create a release in the target directory.
The following instructions assume that a functional KijiBento minicluster has been setup and is running. This example uses the 20Newsgroups dataset.
First, create and populate the 'words' table:
kiji-schema-shell --file=words.ddl
kiji jar target/kiji-chopsticks-0.1.0-SNAPSHOT.jar org.kiji.chopsticks.NewsgroupLoader \
kiji://.env/default/words <path/to/newsgroups/root/>
Run the word count, outputting to hdfs:
kiji jar target/kiji-chopsticks-0.1.0-SNAPSHOT.jar \
com.twitter.scalding.Tool org.kiji.chopsticks.NewsgroupWordCount \
--input kiji://.env/default/words --output ./wordcounts.tsv --hdfs
Check the results of the job:
hadoop fs -cat ./wordcounts.tsv/part-00000 | grep "\<foo\>"
You should see something similar to:
"'foo'\''bar'". 1
"foo"); 1
"foo'bar", 1
"foo.txt 1
"foo.txt" 1
"foo:0", 1
<foo> 1
<foo@cs.rice.edu> 1
>foo 1
`foo' 1
bar!foo!frotz 1
foo 2
foo%bar.bitnet@mitvma.mit.edu 1
foo-boo 1
foo/file 1
foo: 1
foo@mhfoo.pc.my 1