This project is pretty chaotically developed, as this is the easies flow for me.
None of this code is meant for "general consumption".
http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/
http://linuxers.org/tutorial/how-extract-images-video-using-ffmpeg
For installing on Debian from packages: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_4.html
hadoop
http://10.240.57.179:50030/jobtracker.jsp
hbase
http://10.240.57.179:60010/master-status
Install Java7 on Debian
su -
echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list
echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886
apt-get update
apt-get install oracle-java7-installer
exit
Hadoop 1.1.2
http://archive.apache.org/dist/hadoop/core/hadoop-1.1.2/hadoop-1.1.2.tar.gz
HBase 0.94.10
http://ftp.ps.pl/pub/apache/hbase/hbase-0.94.10/hbase-0.94.10.tar.gz
hbase(main):002:0> create 'frames', {NAME => 'youtube'}
scan 'frames'
0 row(s) in 8.2020 seconds
hbase(main):003:0> scan 'frames'
ROW COLUMN+CELL
0 row(s) in 0.2590 seconds
hbase(main):005:0> put 'frames', 'asdf', 'youtube:id', 'asdf'
0 row(s) in 0.1150 seconds
hbase(main):006:0> put 'frames', 'asdf', 'youtube:meta', "{metadata: jsonstuff}"
0 row(s) in 0.0180 seconds
hbase(main):007:0> scan 'frames'
ROW COLUMN+CELL
asdf column=youtube:id, timestamp=1375312436843, value=asdf
asdf column=youtube:meta, timestamp=1375312452591, value={metadata: jsonstuff}
$ hadoop fs -ls /oculus/source
Found 6 items
-rw-r--r-- 3 kmalawski supergroup 1161390621 2013-08-01 00:32 /oculus/source/-A1e_vS5gn4.webm.seq
-rw-r--r-- 3 kmalawski supergroup 78396304 2013-08-01 00:45 /oculus/source/0a78kzAffb4.webm.seq
-rw-r--r-- 3 kmalawski supergroup 3319289461 2013-08-01 00:36 /oculus/source/G8dMlGq6CBE.webm.seq
-rw-r--r-- 3 kmalawski supergroup 1852768320 2013-08-01 00:25 /oculus/source/cj2uhyfVNmQ.webm.seq
-rw-r--r-- 3 kmalawski supergroup 5868144017 2013-08-01 00:28 /oculus/source/fPAaYvL5Vpw.webm.seq
-rw-r--r-- 3 kmalawski supergroup 128484760 2013-08-01 00:34 /oculus/source/gVi_2lHBVhQ.webm.seq
sbt shell
> project scalding
> run pl.project13.scala.oculus.job.VideoToPicturesJob --hdfs --input hdfs://10.240.57.179:9000/oculus/source/cj2uhyfVNmQ.webm --output ignore.out
sbt shell
> project scalding
> run pl.project13.scala.oculus.job.WordCountJob --hdfs --input hdfs://10.240.57.179:9000/demotext.txt --output hdfs://10.240.57.179:9000/wordcount.out
Remember how to get results:
hadoop fs -getmerge /ignore.out ignore.out
Examples are in: /usr/share/doc/libphash0-dev/examples
g++ -g -O3 -I. -pthread -I /usr/local/include image_hashes.cpp -L/usr/local/lib -lpHash -o imageHashes
Compute mh_hash and cdt_hash on a given image:
vagrant@vagrant-ubuntu-precise-32:/tmp$
./image_hashes compr/jasper_johns.jpg
mh_hash: 86 15 7 203 128 57 28 14 7 177 209 106 140 33 192 224 115 100 15 87 96 225 126 7 35 96 108 241 142 7 31 0 56 29 44 109 31 171 255 31 143 199 227 96 45 3 206 48 252 126 63 31 13 45 224 122 56 159 142 50 227 101 100 60 40 243 30 14 1 31 128 104
dct_hash: 10510202026326186564
Use these values to build keys in your db.
Differences between hashes are calculated as Hamming Distance ( http://en.wikipedia.org/wiki/Hamming_distance ).
This calculates phashes of all seq files (well, here just one):
run pl.project13.scala.oculus.job.HashHistVideoSeqFilesJob --hdfs --input hdfs://10.240.57.179:9000/oculus/source/0a78kzAffb4.webm.seq --output hdfs://10.240.57.179:9000/ignore.out
https://github.com/paulp/sbt-extras
required: scala, simple build tool
for downloading movies:
sbt
project downloader
run
for computing metrics from movies on HDFS
sbt
project scalding
run pl.project13.scala.oculus.job.VideoToPicturesJob --hdfs --output hdfs://oculus-one.project13.pl:54310/oculus/target/Pst9a8tZbkk.webm.img-size.tsv --input hdfs://oculus-one.project13.pl:54310/oculus/source/Pst9a8tZbkk.webm.seq
For testing the cluster Vagrant + Puppet setup prepared here should be used: https://github.com/ktoso/vagrant-cascading-hadoop-cluster
it's packahged:
sudo apt-get install libphash0-dev libphash0