GitHub - vladaionescu/MapReduce-Thorn: MapReduce implementation for Thorn

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
ebin		ebin
include		include
lib		lib
presentation		presentation
src		src
test		test
www		www
Makefile		Makefile
README		README

Repository files navigation


MapReduce
---------


Example usage:


make SMR_WORKER_NODES="w1 w2 w3 w4 w5" -j run

MapFun = fun ({_Name, Document}) ->
                  dict:to_list(lists:foldl(fun (Word, Dict) ->
                                                   dict:update_counter(Word, 1, Dict)
                                           end, dict:new(),
                                           string:tokens(Document, " ")))
         end.

ReduceFun = fun ({_Word, CountList}) -> lists:sum(CountList) end.

{ok, Id} = smr_master:new_job(MapFun, ReduceFun, []).

smr_master:add_input(Id, [{abc, "Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map."}]).
smr_master:add_input(Id, [{def, "MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running. When files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects)."},
                          {xyz, "A common performance measurement of a network file system is the amount of time needed to satisfy service requests. In conventional systems, this time consists of a disk-access time and a small amount of CPU-processing time. But in a network file system, a remote access has additional overhead due to the distributed structure. This includes the time to deliver the request to a server, the time to deliver the response to the client, and for each direction, a CPU overhead of running the communication protocol software. The performance of a network file system can be viewed as one dimension of its transparency; to be fully equivalent, it would need to be comparable to that of a local disk."}]).

ok = smr_master:do_job(Id).
Result = smr_master:whole_result(Id).

lists:foreach(fun ({Word, Count}) -> io:format("~s -> ~p~n", [Word, Count]) end, Result).


------------------------------- Thorn Integration ------------------------------

In order to use the adapted interpreter do the following:

- create fisher.jar using the makefile in lib/fisher
- copy the fisher.jar over classes/fisher.jar in your local Thorn interpreter
- add REPOSITORY_PATH/lib/fisher/jinterface-1.5.3.jar to your THORNJARS
environment variable. You can do this by typing:

    export THORNJARS=$THORNJARS:REPOSITORY_PATH/lib/fisher/jinterface-1.5.3.jar
    
where REPOSITORY_PATH is the location of the repository on your machine.
        
After this point you can test if you are using the modified interpretor by
running the example program:
    th -f src/thorn/integration.th -m src/thorn/MAPREDUCE.thm
    
In order for this to work you must have an erlang node running the erlang test
module on your machine with the same name and cookie as the one specified in:
    REPOSITORY_PATH/lib/fisher/src/fisher/runtime/mapreduce/MapReduce.java
    
This is just a proof of concept. It will be extended and implemented properly in
future commits.


In order to add functionality to the interpretor one has to edit the 
MapReduce.java file, the src/thorn/MAPREDUCE.thm file and subsequently to
recompile fisher.jar.