Spark is a large-scale data processing engine. SparkSQL, one of its important component, can access the Hive metastore service to handle Hive tables directly. Furthermore, SparkSQL also provides approach to use data from other external datasources (JDBC to RDB, Mongo, HBase, etc).
Original Target
In my work, I need to handle data from different datasources (mostly Mysql & Mongo) to generate the final OLAP query result. Our goal is to establish a universal data platform to access, especially to process JOIN
operation across schema on multiple datasources.
Approach-1: Pandas ETL engine
We originally used pandas to load required schemas as (pandas) Dataframes and then process all data operations within memory. This approach, however, is
- Time Consuming: requires great efforts to load dataframes into memory
- Lack of Scalability: cannot handle large-scale data well since the entire platform is resided in single node.
- Difficult to Access: needs pandas APIs to process all the data operations. There are methods to use SQL to handle pandas Dataframe (e.g., sql4pandas), but the supported sql syntax is limited.
At last, we come to Spark. In SparkSQL, the basic operational data unit is also DataFrame
, no matter a table in RDB, a collection in MongoDB, or a document in ElasticSearch. Moreover, its lazy evaluation
of Dataframe enable it to process ETL job until the time we really need to access it, which makes it efficient in data handling and aware of change of external datasource.
Approach-2: PySpark Jupyter Notebook
The idea is very easy, we register all Dataframes as temporary tables at first. Then we can use sql via Spark SQLContext to operate multiple datasources directly. Its easy to setup the jupyter notebook environment using PySpark. You can check the following demo notebook at my github repository (here). I post the source code as follows.
Initialize pySpark Environment
|
|
Initial Data Access Drivers (Mysql/Mongo/…)
|
|
Register Temporary Tables from datasources (Mysql/Mongo/…)
|
|
Then We can use SparkSQL as follows:
|
|
Approach-3: OLAP SQL Database on SparkSQL Thrift
We take our step furthermore, we want to make our platform as a database, facilitate us to access it in our program via JDBC driver, and to support different legacy BI application (e.g., Tableau, QlikView).
As mentioned above, SparkSQL can use Hive metastore directly. Thus, we want to start the SparkSQL thriftserver accompy with Hive metastore service, establish the environment with some SparkSQL DDL statements to create the symbol-links
to external datasources.
The work is also very easy, just share the same hive-site.xml between Hive metastore service and SparkSQL thriftserver. We post the content of hive-site.xml as follows. It’s only a toy settings without any Hadoop/HDFS/Mapreduce stuff to highlight the key points, you can adapt it quickly for production use.
Config hive-site.xml
|
|
Start the SparkSQL thriftserver with required jars
|
|
OK, everything is done! Now you can do the same thing as approach-2 to create a symbol-link to external mongo table as follows in your beeline client:
|
|
Then you can access it via normal query statement:
0: jdbc:hive2://localhost:10000> show tables; +--------------+--------------+--+ | tableName | isTemporary | +--------------+--------------+--+ | mongo_table | false | +--------------+--------------+--+ 1 row selected (0.108 seconds) 0: jdbc:hive2://localhost:10000> select * from mongo_table; +------+----+---------------------------+--+ | x | y | _id | +------+----+---------------------------+--+ | 1.0 | a | 5715f227d2f82889971df7f1 | | 2.0 | b | 57170b5e582cb370c48f085c | +------+----+---------------------------+--+ 2 rows selected (0.38 seconds)