Skip to content
This repository has been archived by the owner on Oct 14, 2020. It is now read-only.

rwachtler/ITM13TermStatistics

Repository files navigation

ITM13TermStatistics

This is a project for the Practical Software Engineering course of ITM13

Contribute guidelines

  • This repository represents the whole project, not a container for partially solutions. Your implementations should be included in Java Resources/src.
  • Please do not create unnecessary packages, if there is an existing package which you could use - USE IT!
  • Take code reviews, submitted code should be reusable, easy to understand and well documented.
  • Try your best to produce "clean" code
  • Always attach a description to your pull-request, so me or @MICSTI know what's the aim of your pull-request without the need to take a deep look inside your code or commit messages.

Package Structure

  • Root path to java files will be src/at/fhj/itm/pswe
  • Package PageCrawler: Contains all files and the whole functionality of the Pagecrawler
    • Subpackages LinkCrawler, WordAnalyzer according to the different tasks of the algorithm
      • Within every Subpackes contained will be related Packages such as Model, Business, Helper, ...
  • Package Database: Contains everything with regard to Database Access
    • Examples Would be DAO's, or Connection Classes.
  • Package REST: Like standard Wildflypackage, contains Endpoints
    • May contain Helper and Business Packages to format data

"Filename too long" problem under windows

In case you get the "Filename too long" error message while cloning the repository on a windows system there is a git config command to fix it.

  • git config --system core.longpaths true

How to work with gulp

In case of editing front-end files (/WebContent directory), please take use of gulp automation toolkit. Please do not edit the .min files, those are being generated by gulp.

What you need to do

  • Install gulp globally using the node package manager
    • $ npm install --global gulp
  • Run gulp inside the /WebContent directory
    • $ gulp

Result-File

  • Position: Wildfly-Path/bin/result/crawl
  • naming: subdomain_domain_tld_MM_DD_YYYY-HH_MM

Example for pswengi.bamb.at started crawling on 23.11.2015 14:21

pswenig_bamb_at_11_23_2015-14_21.txt

Filename can be gathered from Init_LinkCrawler object with .getFilename();

Result-File Structure generated from the crawler

1 Line: URL

http://pswengi.bamb.at

2 Line: Date of crawling the page (in dd:MM:yyyy)

13:12:2015

3/5/7... Line: URL of the text, that is located in the next line

http://pswengi.bamb.at/article1.html

4/6/8... Line: Gathered Text from the "current" URL. Repeates for each visited URL

Here are some random generated Word :P

Last line: Time, how long the crawler was running (in hh:mm:ss)

0:0:43

##Subsites

  • Subsites( Word/Site Overview) are called via Servlet
  • url: TermStatistics/SiteOverview/{idOfSite}

##Input Validation

  • Due to CORS, URL check may fail even if url is valid
  • Crawldepth has to be at least 1

REST-Calls

  • /rest/action/crawler/{crawlerid} -> restart the Crawler for an already safed Website in the Database
  • /rest/article/{articleid}/words -> get all words from one article identified by its id with amount
  • /rest/article/{articleid}/words/{num} -> get limited number of words (by num) from one article identified by its id with amount
  • /rest/website -> all Websites in the Database
  • /rest/website/{websiteid}/articles -> all articles on this website
  • /rest/website/{websiteid}/words -> all words an amount on this website
  • /rest/website/{websiteid}/words/{num} -> all words an amount on this website, limit amount of words by num
  • /rest/website/{websiteid}/period/10.11.2015/30.11.2015 -> words of one site in the given period
  • /rest/word/{word}/websites -> all websites of specific word with corresponding amounts
  • /rest/word/{word}/period/10.11.2015/30.11.2015 -> one word with all dates & amounts in the given period

About

This is a project for the Practical Software Engineering course of ITM13

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published