An experimental project, handling millions of data with Dropwizard and H2.
To handle millions of views per hour with a portable database.
- There are web pages for each users.
- There are 5 millions of page views per hour.
- There are pages that shows ID and the accessed dates of the recent 10 visitors of each user.
When there are 5 millions of data per hour
- 3.6 billions / month
- 120 millions / day
- 83 thousands / min
- 1,400 / sec
If a row takes up 20 bytes,
- 72G / month
- 2.4G / day
- 100M / hour
- 1.6M / min
- 26KB / sec
Let's check the reason why each row takes up 20 bytes.
CREATE TABLE view (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
host_id INT,
visitor_id INT,
date TIMESTAMP
);
CREATE INDEX ON view (host_id);
- Used BIGINT(8 bytes) for ID because the maximum value of INT is only about 2 billions.
- We have 3.6 billion views per month.
- Used INT(4 bytes) for user IDs as it's hardly possible to surpass 2 billion users.
- Used TIMESTAMP(4 bytes) for date instead of DATETIME(8 bytes) to save space.
- We can change to DATETIME before 2038.
- Given the schema, the size of each row would be 20 bytes (8+4+4+4).
- Refer to Data Type Storage Requirements.
- The size of the indices aren't counted in the calculation.
Yes, because
- they take up too much storage.
- the performance of queries is getting inefficient as the number of rows grows.
As we need only 10 recent view records for each user, we delete other than those.
At this point, we need approximate the user count. Assuming a user visits ten times a day,
- we have about 10 million users
- 100 million rows(2G) when each user has 10 view records
Given the rows, it's not possible to manage the data well even with periodic deletion. So I chose to split the table horizontally.
Because deletion is very inefficient with limited functions of H2, I chose 1,000 for user count per table to make the deletion fast. Then we need 10,000 tables for 10 million users. The numbers can be adjusted depends of the environment or use cases.
- The deletion need to be very fast because a table is locked while deleting.
- Technically, there's no limit for maximum number of tables.
- In practice, we don't need to delete often if we assume that a user visits ten times a day.
- In the experiment, I'm going to create a lot of views for restricted users and delete redundant data every 10 seconds.
Cons of splitting tables:
- It's not easy to manage tables. e.g. changing schema, getting count of all rows, etc.
- Cannot create foreign keys.
- Foreign keys are luxury when we need to handle such big data.
- It would be slow and complex to search with fields other than
host_id
, likevisitor_id
ordate
.
Right, this is similar to Partitioning. But H2 doesn't support partitioning, it's in the roadmap though, and no other local file based databases support partitioning.
- Run the web application server by executing the below command in the project home directory.
java -jar target/portable-db-experiment-1.0-SNAPSHOT.jar server database.yml
- When the server starts, you can access to users page which has view records.
- You can see who visited where through the URLs in the server logs.
- You can create views using
curl
.
- e.g.
curl -X POST "http://localhost:8080/user/1?visitor_id=2"
Two batch scripts in script/
directory.
clear.sh
callsdelete
andselect
APIs. In this script:
- it deletes redundant data every 10 seconds.
- it shows current number of records of each tables.
stress.sh
callscreate
API. In this script:
- it keeps visiting random users pages.
- it shows the number of records of each tables.
- although we postulated that we get 1,400 requests per second, it's not easy to make that number of requests in a single machine.
For a faster and visible test, I've set values like below
- Table count: 100
- User(Host) count per table: 1,000
- User ID range: 1~10,000
With this settings,
- IDs between 1 and 100,000 (100 * 1,000) are acceptable to the server.
- Tables from 1st to 10th (10,000 / 1,000) are going to be filled.
- Record count of each table converges to 10,000 (1,000 * 10).