This project is a course project in ECE4721J Methods and Tools for Big Data @UM-SJTU Joint Institute. In general, we built a song recommendation system based on big data analysis with several tools and algorithms on Million Song Dataset.
You can find the source code in my Github repo.
Project Goals
- Work with Hadoop, Drill and Spark
- Compare MapReduce and Spark
- Retrieve specific information in big data
- Perform advanced data analysis on big data
Table of content:
- Milestone 0: HDF5 Data Pre-process
- Milestone 1: Drill Database Query
- Milestone 2: Advanced Data Analysis
Milestone 0: HDF5 Data Pre-process
0.1 Compact small hdf5
files into larger one
1
python3 create_aggregate_file.py <IN> <OUT>
Input: a directory contains
hdf5
song filesOutput: an aggregate
hdf5
song file
0.2 Read hdf5
file and extract the information
1
python3 display_song.py [FLAGS] <HDF5> <idx> <field>
Input: an
hdf5
song fileOutput: specified field content
0.3 Convert hdf5
to Avro
with Apache Avro™
1
hdf5_to_avro.py [-h] -s <SCHEMA> -i <HDF5> -o <AVRO>
Input:
an
Avro
schema filean
hdf5
song file to be converted
Output: an
Avro
song fileSample schema file in
json
format:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
"namespace": "song.avro",
"type": "record",
"name": "Song",
"fields": [
{
"name": "artist_name",
"type": ["string", "null"]
},
{
"name": "title",
"type": ["string", "null"]
}
]
}
Milestone 1: Drill Database Query
Query Million Song Dataset (MSD) with Drill
:
1.1 Find the range of dates covered by the songs in the dataset
SQL:
1 2 3 4 5 6 7 8
-- Age of the oldest songs SELECT 2022 - MAX(year) AS Age FROM hdfs.`/pj/m0/output.avro`; -- Age of the youngest songs SELECT 2022 - MIN(year) AS Age FROM hdfs.`/pj/m0/output.avro` WHERE year > 0;
Results:
+--------+ +--------+ | Age | | Age | +--------+ +--------+ | 12 | | 96 | +--------+ +--------+ 1 rows selected (0.864 seconds) 1 rows selected (0.642 seconds)
The oldest song’s age is 96 and the youngest is 12.
As a result, the range of dates covered by the songs is 84 years.
1.2 Find the hottest song that is the shortest and shows highest energy with lowest tempo
+------------------------------------------------------+
| title |
+------------------------------------------------------+
| b'Immigrant Song (Album Version)' |
| b"Nothin' On You [feat. Bruno Mars] (Album Version)" |
| b'This Christmas (LP Version)' |
| b'If Today Was Your Last Day (Album Version)' |
| b'Harder To Breathe' |
| b'Blue Orchid' |
| b'Just Say Yes' |
| b'They Reminisce Over You (Single Version)' |
| b'Exogenesis: Symphony Part 1 [Overture]' |
| b'Inertiatic Esp' |
+------------------------------------------------------+
10 rows selected (0.471 seconds)
1.3 Find the name of the album with the most tracks
SQL:
1 2 3 4 5
SELECT release, COUNT(release) AS NumTrack FROM hdfs.`/pj/m0/output.avro` GROUP BY release ORDER BY NumTrack desc LIMIT 1;
Results:
+------------------+----------+ | release | NumTrack | +------------------+----------+ | b'Greatest Hits' | 21 | +------------------+----------+ 1 row selected (0.695 seconds)
1.4 Find the name of the band who recorded the longest song
SQL:
1 2 3 4
SELECT artist_name, duration FROM hdfs.`/pj/m0/output.avro` ORDER BY duration DESC LIMIT 1;
Results:
+-------------+-----------+ | artist_name | duration | +-------------+-----------+ | b'UFO' | 1819.7677 | +-------------+-----------+ 1 row selected (0.27 seconds)
Milestone 2: Advanced Data Analysis
2.1 BFS with MapReduce - A Simple Example
Let’s say we want to find artists similar to A with distance 3, and we have the following relationship graph (each edge has distance 1)
With n MapReduce BFS iterations, we can get similar artist with distance n.
2.1.1 Initialize Graph File with Target Artist
Format: each line contains Node | Distance | Neighbours
2.1.2 Generate Distance Pairs in Mapper
Mapper: Generate Neighbours, Node, Distance+1 if not itself
2.1.3 Merge Distance Pairs in Reducer
Reducer: Merge the same Neighbour, keep distance minimum
2.1.4 Iteraiton 2: Mapper
2.1.4 Iteraiton 2: Reducer
2.2 BFS with Spark
Same algorithm as is proposed before
Implemented using Python with
PySpark
For the implementation, we:
Convert data into RDD map
Using Spark
sortByKey()
to sort RDD aggregated by node index and then combining the neighboursUsing Spark
reduce()
to pick the minimum distance of different neighbours towards the central node
Spark is assumed to be faster since subsequent steps are retained in memory with a trade-off of much more memory consumption
2.3 Benchmark
Data size: around 20GB
Server: SJTU cluster with three machines
- CPU: Dualcore Intel Xeon Processor (Skylake, IBRS)
- Memory: 4GB
MapReduce: around 250s | Spark: around 45s |
---|---|
Reference
Million Song Dataset.
http://millionsongdataset.com
Apache Avro Documentation.
https://avro.apache.org/docs/current/index.html
Apache Spark Documentation.
https://spark.apache.org/docs/latest/
Thanks for your attention!