Disclaimer

This project is pretty chaotically developed, as this is the easies flow for me.

None of this code is meant for "general consumption".

Links

http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/

http://linuxers.org/tutorial/how-extract-images-video-using-ffmpeg

Downloads

For installing on Debian from packages: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_4.html

Monitoring urls

hadoop

http://10.240.57.179:50030/jobtracker.jsp

hbase

http://10.240.57.179:60010/master-status

Install Java7 on Debian

su -
echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list
echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886
apt-get update
apt-get install oracle-java7-installer
exit

Hadoop 1.1.2

http://archive.apache.org/dist/hadoop/core/hadoop-1.1.2/hadoop-1.1.2.tar.gz

HBase 0.94.10

http://ftp.ps.pl/pub/apache/hbase/hbase-0.94.10/hbase-0.94.10.tar.gz

Create stuff in hbase

hbase(main):002:0> create 'frames', {NAME => 'youtube'}
scan 'frames'
0 row(s) in 8.2020 seconds


hbase(main):003:0> scan 'frames'
ROW                                                          COLUMN+CELL
0 row(s) in 0.2590 seconds


hbase(main):005:0> put 'frames', 'asdf', 'youtube:id', 'asdf'
0 row(s) in 0.1150 seconds

hbase(main):006:0> put 'frames', 'asdf', 'youtube:meta',  "{metadata: jsonstuff}"
0 row(s) in 0.0180 seconds


hbase(main):007:0> scan 'frames'
ROW                                                          COLUMN+CELL
 asdf                                                        column=youtube:id, timestamp=1375312436843, value=asdf
 asdf                                                        column=youtube:meta, timestamp=1375312452591, value={metadata: jsonstuff}

Run downloader

List files on HDFS from the downlader

$ hadoop fs -ls /oculus/source
Found 6 items
-rw-r--r--   3 kmalawski supergroup 1161390621 2013-08-01 00:32 /oculus/source/-A1e_vS5gn4.webm.seq
-rw-r--r--   3 kmalawski supergroup   78396304 2013-08-01 00:45 /oculus/source/0a78kzAffb4.webm.seq
-rw-r--r--   3 kmalawski supergroup 3319289461 2013-08-01 00:36 /oculus/source/G8dMlGq6CBE.webm.seq
-rw-r--r--   3 kmalawski supergroup 1852768320 2013-08-01 00:25 /oculus/source/cj2uhyfVNmQ.webm.seq
-rw-r--r--   3 kmalawski supergroup 5868144017 2013-08-01 00:28 /oculus/source/fPAaYvL5Vpw.webm.seq
-rw-r--r--   3 kmalawski supergroup  128484760 2013-08-01 00:34 /oculus/source/gVi_2lHBVhQ.webm.seq

Calculate the PHASH function (reqs hbase)

sbt shell

> project scalding

> run pl.project13.scala.oculus.job.VideoToPicturesJob --hdfs --input hdfs://10.240.57.179:9000/oculus/source/cj2uhyfVNmQ.webm --output ignore.out

Run sanity test - word count

sbt shell
> project scalding
> run pl.project13.scala.oculus.job.WordCountJob --hdfs --input hdfs://10.240.57.179:9000/demotext.txt --output hdfs://10.240.57.179:9000/wordcount.out

Remember how to get results:

hadoop fs -getmerge /ignore.out ignore.out

compile phash test runner

Examples are in: /usr/share/doc/libphash0-dev/examples

 g++ -g -O3 -I. -pthread -I /usr/local/include image_hashes.cpp -L/usr/local/lib -lpHash -o imageHashes

Compute mh_hash and cdt_hash on a given image:

vagrant@vagrant-ubuntu-precise-32:/tmp$

./image_hashes  compr/jasper_johns.jpg

mh_hash: 86 15 7 203 128 57 28 14 7 177 209 106 140 33 192 224 115 100 15 87 96 225 126 7 35 96 108 241 142 7 31 0 56 29 44 109 31 171 255 31 143 199 227 96 45 3 206 48 252 126 63 31 13 45 224 122 56 159 142 50 227 101 100 60 40 243 30 14 1 31 128 104
dct_hash: 10510202026326186564

Use these values to build keys in your db.

Differences between hashes are calculated as Hamming Distance ( http://en.wikipedia.org/wiki/Hamming_distance ).

Run the phash job

This calculates phashes of all seq files (well, here just one):

run pl.project13.scala.oculus.job.HashHistVideoSeqFilesJob --hdfs --input hdfs://10.240.57.179:9000/oculus/source/0a78kzAffb4.webm.seq --output hdfs://10.240.57.179:9000/ignore.out

Install tooling

https://github.com/paulp/sbt-extras

Running the app

required: scala, simple build tool

for downloading movies:

sbt
project downloader
run

for computing metrics from movies on HDFS

sbt
project scalding
run pl.project13.scala.oculus.job.VideoToPicturesJob --hdfs --output hdfs://oculus-one.project13.pl:54310/oculus/target/Pst9a8tZbkk.webm.img-size.tsv --input hdfs://oculus-one.project13.pl:54310/oculus/source/Pst9a8tZbkk.webm.seq

Testing Hadoop / HBase Cluster

For testing the cluster Vagrant + Puppet setup prepared here should be used: https://github.com/ktoso/vagrant-cascading-hadoop-cluster

phash on ubuntu

it's packahged:

sudo apt-get install libphash0-dev libphash0

Name		Name	Last commit message	Last commit date
Latest commit History 584 Commits
chef		chef
common/src/main		common/src/main
docs		docs
downloader/src		downloader/src
nodes		nodes
phash-c		phash-c
project		project
scalding/src		scalding/src
ssh2		ssh2
.DS_Store		.DS_Store
.gitignore		.gitignore
.gitmodules		.gitmodules
.history		.history
4-jobs-find-similar.txt		4-jobs-find-similar.txt
Big_Buck_Bunny_mirror.png		Big_Buck_Bunny_mirror.png
Big_Buck_Bunny_normal.png		Big_Buck_Bunny_normal.png
CompareTwoMoviesJob-steps.dot		CompareTwoMoviesJob-steps.dot
CompareTwoMoviesJob-steps.dot.png		CompareTwoMoviesJob-steps.dot.png
CompareTwoMoviesJob.dot		CompareTwoMoviesJob.dot
CompareTwoMoviesJob.dot.png		CompareTwoMoviesJob.dot.png
FindSimilarMoviesJob.dot		FindSimilarMoviesJob.dot
FindSimilarMoviesJob.dot.png		FindSimilarMoviesJob.dot.png
FindSimilarMoviesV2Job-steps.dot		FindSimilarMoviesV2Job-steps.dot
FindSimilarMoviesV2Job-steps.dot.png		FindSimilarMoviesV2Job-steps.dot.png
FindSimilarMoviesV2Job.dot		FindSimilarMoviesV2Job.dot
FindSimilarMoviesV2Job.dot.png		FindSimilarMoviesV2Job.dot.png
HashHistVideoSeqFilesJob-steps.dot		HashHistVideoSeqFilesJob-steps.dot
HashHistVideoSeqFilesJob-steps.dot.png		HashHistVideoSeqFilesJob-steps.dot.png
HashHistVideoSeqFilesJob.dot		HashHistVideoSeqFilesJob.dot
HashHistVideoSeqFilesJob.dot.png		HashHistVideoSeqFilesJob.dot.png
HashVideoSeqFilesJob-steps.dot		HashVideoSeqFilesJob-steps.dot
HashVideoSeqFilesJob-steps.dot.png		HashVideoSeqFilesJob-steps.dot.png
HashVideoSeqFilesJob.dot		HashVideoSeqFilesJob.dot
HashVideoSeqFilesJob.dot.png		HashVideoSeqFilesJob.dot.png
README.md		README.md
SortWithTakeJob-steps.dot		SortWithTakeJob-steps.dot
SortWithTakeJob-steps.dot.png		SortWithTakeJob-steps.dot.png
SortWithTakeJob.dot		SortWithTakeJob.dot
SortWithTakeJob.dot.png		SortWithTakeJob.dot.png
WordCountJob-steps.dot		WordCountJob-steps.dot
WordCountJob-steps.dot.png		WordCountJob-steps.dot.png
WordCountJob.dot		WordCountJob.dot
WordCountJob.dot.png		WordCountJob.dot.png
body-bg.png		body-bg.png
bunny-butterfly-5plus-light.png		bunny-butterfly-5plus-light.png
bunny-butterfly.png		bunny-butterfly.png
bunny-standing.png		bunny-standing.png
cleaning-trash.txt		cleaning-trash.txt
data		data
datanode-namespace-ids-dont-match.txt		datanode-namespace-ids-dont-match.txt
dfs-distribution-example.txt		dfs-distribution-example.txt
dot-all.sh		dot-all.sh
download-upload-seq-file.txt		download-upload-seq-file.txt
example_hbase_count.txt		example_hbase_count.txt
example_reference_seq_listing.txt		example_reference_seq_listing.txt
executed-on-cluster-one-hashing.txt		executed-on-cluster-one-hashing.txt
find-similar-movies-1.txt		find-similar-movies-1.txt
hashing-30-movies.txt		hashing-30-movies.txt
hashing-bunny-hbase.txt		hashing-bunny-hbase.txt
hashing_one_file_takes_n_time.txt		hashing_one_file_takes_n_time.txt
hbase-metadata-describe.txt		hbase-metadata-describe.txt
hbase-regionserver-status.txt		hbase-regionserver-status.txt
image1009.png		image1009.png
image1019.png		image1019.png
image1029.png		image1029.png
image1039.png		image1039.png
image1049.png		image1049.png
image1059.png		image1059.png
image1069.png		image1069.png
image1079.png		image1079.png
image1089.png		image1089.png
image109.png		image109.png
image1099.png		image1099.png
image1919.png		image1919.png
image2919.png		image2919.png
image3919.png		image3919.png
image4919.png		image4919.png
image5919.png		image5919.png
image6919.png		image6919.png
image7919.png		image7919.png
image919.png		image919.png
install-deps.sh		install-deps.sh
lots-of-image-hashes.txt		lots-of-image-hashes.txt
merger-in-action.txt		merger-in-action.txt
naive-result-by-looking-at-distances-two-movies.txt		naive-result-by-looking-at-distances-two-movies.txt
notes.txt		notes.txt
oculus-hashing1930882678372131334.png		oculus-hashing1930882678372131334.png
out.png		out.png
phash-distances.sh		phash-distances.sh
pl.project13.scala.oculus.job.FindSimilarMovies0.dot		pl.project13.scala.oculus.job.FindSimilarMovies0.dot
pl.project13.scala.oculus.job.FindSimilarMovies0.dot.png		pl.project13.scala.oculus.job.FindSimilarMovies0.dot.png
pl.project13.scala.oculus.job.FindSimilarMoviesJob0.dot		pl.project13.scala.oculus.job.FindSimilarMoviesJob0.dot
pl.project13.scala.oculus.job.FindSimilarMoviesJob0.dot.png		pl.project13.scala.oculus.job.FindSimilarMoviesJob0.dot.png
pl.project13.scala.oculus.job.FindSimilarMoviesJob0_steps.dot		pl.project13.scala.oculus.job.FindSimilarMoviesJob0_steps.dot
pl.project13.scala.oculus.job.FindSimilarMoviesJob0_steps.dot.png		pl.project13.scala.oculus.job.FindSimilarMoviesJob0_steps.dot.png
pl.project13.scala.oculus.job.HashVideoSeqFilesJob0.dot		pl.project13.scala.oculus.job.HashVideoSeqFilesJob0.dot
pl.project13.scala.oculus.job.HashVideoSeqFilesJob0.dot.png		pl.project13.scala.oculus.job.HashVideoSeqFilesJob0.dot.png
pl.project13.scala.oculus.job.HashVideoSeqFilesJob0_steps.dot		pl.project13.scala.oculus.job.HashVideoSeqFilesJob0_steps.dot
pl.project13.scala.oculus.job.HashVideoSeqFilesJob0_steps.dot.png		pl.project13.scala.oculus.job.HashVideoSeqFilesJob0_steps.dot.png
pl.project13.scala.oculus.job.WordCountJob0.dot		pl.project13.scala.oculus.job.WordCountJob0.dot

ktoso/oculus

Folders and files

Latest commit

History