GitHub - nezihyigitbasi/FlinkParquet: Using the Parquet file format (with Avro) to process data with Apache Flink

Crunching Apache Parquet Files with Apache Flink

This repo includes sample code to setup Flink dataflows to process Parquet files. The CSV datasets under resources/ are the Restaurant Score datasets downloaded from SF OpenData. For more information please see this post.

###Generating the Avro Model Classes

If you make any changes to the Avro schema files (*.avsc) under resources/ you should re-generate the model classes

./compile_schemas.sh

###Step 1: Converting the CSV Data Files to the Parquet Format

Below command converts and writes the CSV files under resources/ to /tmp/business, /tmp/violations, and /tmp/inspections directories in Parquet format.

mvn clean package exec:java -Dexec.mainClass="yigitbasi.nezih.ConvertToParquet"

###Step 2: Running the Flink Dataflow

mvn clean compile assembly:single
java -jar target/FlinkParquet-0.1-SNAPSHOT-jar-with-dependencies.jar

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/main		src/main
tools		tools
.gitignore		.gitignore
README.md		README.md
compile_schemas.sh		compile_schemas.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main

src/main

tools

tools

.gitignore

.gitignore

README.md

README.md

compile_schemas.sh

compile_schemas.sh

pom.xml

pom.xml

Repository files navigation

Crunching Apache Parquet Files with Apache Flink

About

Releases

Packages

Languages

nezihyigitbasi/FlinkParquet

Folders and files

Latest commit

History

Repository files navigation

Crunching Apache Parquet Files with Apache Flink

About

Resources

Stars

Watchers

Forks

Languages