Leveraging Alluxio with Spark SQL to Speed Up Ad-hoc Analysis

Background

At present, hundreds of TB of data is processed in Momo bigdata cluster every day. However, most of the data will be read/write through disk repeatedly, which is ineffective. In order to speed up data processing and provide better user experience, after some investigation we found that Alluxio may fit our need. Alluxio works by providing a unified memory speed distributed storage system for various jobs. I/O speed in memory is faster than in hard disk. Hot data in Alluxio could server memory speed I/O just like memory cache. So the more frequent data read/written over Alluxio, the greater the benefit will have. In order to better understand the value Alluxio have to our ad-hoc service which uses Spark SQL as executing engine, we designed a series experiments of Alluxio with Spark SQL.

Experiment Design

There are a few designs which aims to take advantage of Alluxio:

Firstly, we use decoupled computer and storage architecture for the reason that mixed deployment will leave a heavy I/O burden to Alluxio, so DataNode is not deployed with Alluxio worker. The Alluxio cluster is decoupled from HDFS storage, it will read data from remote HDFS nodes for the first execution.
Secondly, in order to mock the online environment, we use YARN node label feature to divide an Alluxio cluster from production cluster, which means the Alluxio cluster share the same NameNode and ResourceManager from production cluster and may be affected by the pressure of production cluster.
Thirdly, there is only one copy of data stored in Alluxio, which means it can not guarantee high availability. What's more, persisting data to second tier storage such as HDFS is low efficient and space wasteful. Considering about stability and efficiency, we choose to use Alluxio as a read-only cache in our experiment .

The figure below shows the deployment of Alluxio cluster with production cluster.

Figure 1. Alluxio with Spark SQL Architecture

The experiment environment of Alluxio cluster is the same as production except for no DataNode process. So it will have data transportation cost for the first running. Besides, we use Spark Thrift Server to provide ad-hoc analysis service, then all the SQLs tests are made through Spark Thrift Server.

Environment Preparation

Basically, we followed the official instruction of Running Apache Hive with Alluxio and Running Spark on Alluxio to deploy Allxuio with Spark SQL.

Name	Configuration
Software	Spark 2.2.1, Hadoop 2.6.0(HDFS is 2.6.0 while YARN is 2.8.3), Alluxio 1.6.1, Hive 0.14.0
Hardware	32 core, 192G Mem, 5.4T*12 HDD

Here is the software and hardware environment on each node.

Name	Configuration
Software	Spark 2.2.1, Hadoop 2.6.0(HDFS is 2.6.0 while YARN is 2.8.3), Alluxio 1.6.1, Hive 0.14.0
Hardware	32 core, 192G Mem, 5.4T*12 HDD

And the configuration of Spark Thrift Server is:

/opt/spark-2.2.1-bin-2.6.0/sbin/start-thriftserver.sh   
--master yarn-client 
--name adhoc_STS_Alluxio_test
--driver.memory   15g
--num-executor 132
--executor-memory 10G
--executor-cores 3
--conf spark.yarn.driver.memoryOverhead=768m
--conf spark.yarn.executor.memoryOverhead=1024m
--conf spark.sql.adaptive.enabled=true

Performance Test

1) Test background

In production environment, we provide ad-hoc service by leveraging Spark SQL, which is of high performance and convenience over MR and TEZ.

2)Test case

1) Small data test

First we perform an approximately 5min job, which takes no more than 10 GB data. However, the average time cost is close to the time in production environment, which shows no obvious improvement. After searching for the reason, a blog from Alluxio User List explains that the testing job should be IO bounded because of OS's buffer cache. Spark will also temporarily persist the input data in the test. So when we performed the small data input test job, the running time of Spark on Alluxio has no difference with Spark alone. The data size is too small to fully utilize Alluxio's caching ability.

Figure 2. Detailed Information of Test Job

The content marked by red line shows the job only have approximate 5 GB data input size, which is relatively small and OS's buffer cache can hold the all data. Then we proposed another test.

2) Large data test

SQL NO.	Input data size
SQL1	300G
SQL2	1T
SQL3	1.5T
SQL4	5.5T

To better evaluate the performance of Alluxio, we picked 4 different SQLs from online with data input size ranging form 300GB to 5.5TB. Besides, we designed four test groups of Alluxio, Spark, Yarn and Alluxio on disk.

SQL NO.	Input data size
SQL1	300G
SQL2	1T
SQL3	1.5T
SQL4	5.5T

We executed each SQL for 4 times to eliminate the caching deviation of the first running and calculate the average running time of the latest 3 running times.

Test Group	Comment
Alluxio	Spark on Alluxio on Alluxio cluster
Spark	Spark without Alluxio on Alluxio cluster
Yarn	Spark without Alluxio on production cluster
Alluxio on Disk	Spark on Alluxio with only one HDD tier on Alluxio cluster

Here is the explaination of the test group.

Test Group	Comment
Alluxio	Spark on Alluxio on Alluxio cluster
Spark	Spark without Alluxio on Alluxio cluster
Yarn	Spark without Alluxio on production cluster
Alluxio on Disk	Spark on Alluxio with only one HDD tier on Alluxio cluster

Figure 3. Test Result

Then we get the following chart to better understand figure 3.

Figure 4. Comparision of Test Result

Conclusion

From the experiment we can get some interesting conclusions.

From the large and small data test, it is shown that Alluxio is not suitable for small data jobs when using large memory machines, and it should be used in large data scenario.
In figure 4, the max consumption of time is usually the first running time. In general, when getting data for the first running, reading data through Alluxio may be slower than directly from HDFS. The reason is that data is going from HDFS to Alluxio first and then go into Spark process. Actually, that method depends on SQL characteristics. In SQL 2, SQL 3 and SQL 4, Alluxio group performs better than Spark group.
Reading data from cache is generally faster than from disk. However, by comparing Alluxio and Alluxio on disk group, the speed of reading data from cache is similar to reading from disk in SQL 2 and SQL 3, so the performance improvement depends on the SQL workload.
Generally speaking, Spark on Alluxio can achieve a good acceleration, which is 3x - 5x than Spark on production environment and 1.5x - 3x than Spark without Alluxio.

All in all, Alluxio does have obvious effect to our ad-hoc analysis service and different featured SQL have different effect on Alluxio acceleration.The reason why our test does not achieve up to 10x improved performance which the official website declares may be that our testing SQL is selected from online job, which is more complicated and contains plenty of computing and I/O cost.

What We Have Done

The memory in Alluxio cluster is limited, if all the used tables are loaded into Alluxio, the first memory tier will be full quickly and then spills the excessive data to second tier. It will severely affect Alluxio performance. In order to prevent it form being happended, we developed a whitelist feature to decide which table to be loaded into Alluxio for caching.
As Alluxio has only one copy of data in memory and do not guarantee high availability, so it is used as a read only cache in our scenerio, we developed a feature to read from Alluxio and write directly to HDFS. If there is a query sql and table involved is in whitelist, the scheme of path of the table will be transformed to Alluxio scheme and applications can read data from Alluxio and write to HDFS.
The official ad-hoc usage scenerio of Alluxio is to use Presto and Hive as query enging, this method is being widely used. In this article, we proved that Spark SQL can also be use in ad-hoc with Alluxio. Compared to Presto, Spark has better fault tolerance than Presto and still keeps a good performance. So Alluxio with Spark SQL is another good technical option for ad-hoc service.

Future Work

As Alluxio does speed up our service remarkably, we would like to applying more framework such as hive and Spark MLlib to Alluxio and let it become the uniform data ingestion interface to computing layer. Besides, more efforts on security, stability and job monitoring will be on our way.

References

www.alluxio.org
Alluxio users
How Alluxio is Accelerating Apache Spark Workloads

About MOMO

MOMO Inc (Nasdaq: MOMO) is a leading mobile pan-entertainment social platform in China. It has reached approximately 100 million monthly active users and over 300 million users worldwide. Currently the MOMO big data platform has over 25,000 cores and 80TB memory which support approximately 20,000 jobs every day. The MOMO data infrastructure team works on providing stable and efficient batch and streaming solutions to support services like personalized recommendation, ads and business data analysis demand. By using Alluxio, the efficiency of ad-hoc service has achieved a significant improvement. This blog provides some performance tests to evaluate how much benefit can Alluxio contributes to our ad-hoc service as well as a new scenario of using Spark SQL on Alluxio.