Skip to content

jan-niestadt/BlackLab

 
 

Repository files navigation

What is BlackLab?

BlackLab is a corpus retrieval engine built on top of Apache Lucene. It allows fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text. It was developed at the Institute of Dutch Lexicology (INL) to provide a fast and feature-rich search interface on our historical and contemporary text corpora.

We're also working on BlackLab Server, a web service interface to BlackLab, so you can access it from any programming language. BlackLab Server is included in the repository as well.

BlackLab and BlackLab Server are licensed under the Apache License 2.0.

To learn how to index and search your data, see the official project site.

Learn about BlackLab's structure and internals (work in progress).

Changed: 'main' branch

The branch that corresponds to BlackLab's latest release is now called main instead of master.

Local clones can either be removed and re-cloned, or you can rename the local branch with these commands:

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Please note that dev, not main, is the default branch. This is the development branch, which should be considered unstable.

Using BlackLab with Docker

An experimental Docker setup is provided now. It will likely change in the future.

We assume here that you are familiar with the BlackLab indexing process; see indexing with BlackLab to learn more.

Create a file named test.env with your indexing configuration:

IMAGE_VERSION=latest
BLACKLAB_FORMATS_DIR=/path/to/my/formats
INDEX_NAME=my-index
INDEX_FORMAT=my-file-format
INDEX_INPUT_DIR=/path/to/my/input-files
JAVA_OPTS=-Xmx10G

To index your data:

docker-compose --env-file test.env run --rm indexer

Now start the server:

docker-compose up -d

Your index should now be accessible at http://localhost:8080/blacklab-server/my-index.

See the Docker README for more details.

Special thanks

About

A corpus retrieval engine based on Apache Lucene

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 98.5%
  • JavaScript 0.7%
  • C 0.5%
  • HTML 0.2%
  • Shell 0.1%
  • Dockerfile 0.0%