DataFrame
的概念来自R/Pandas
语言,不过R/Pandas
只是runs on One Machine
,DataFrame
是分布式的,接口简单易用。
- Threshold: Spark RDD API VS MapReduce API
- One Machine:R/Pandas
官网的说明
http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#datasets-and-dataframes
拔粹如下:
- A Dataset is a distributed collection of data:分布式的数据集
- A DataFrame is a Dataset organized into named columns. (RDD with Schema)
以列(列名、列的类型、列值)的形式构成的分布式数据集,按照列赋予不同的名称- An abstraction for selecting,filtering,aggregation and plotting structured data
- It is conceptually equivalent to a table in a relational database
or a data frame in R/Python
RDD
与DataFrame
对比:
-
RDD
运行起来,速度根据执行语言不同而不同:
java/scala ==> jvm
python ==> python runtime
-
DataFrame
运行起来,执行语言不同,但是运行速度一样:
java/scala/python ==> Logic Plan
根据官网的例子来了解下DataFrame
的基本操作,
import org.apache.spark.sql.SparkSession
/**
* DataFrame API基本操作
*/
object DataFrameApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("DataFrameApp")
.master("local[2]")
.getOrCreate();
// 将json文件加载成一个dataframe
val peopleDF = spark.read.json("C:\\Users\\Administrator\\IdeaProjects\\SparkSQLProject\\spark-warehouse\\people.json");
// Prints the schema to the console in a nice tree format.
peopleDF.printSchema();
// 输出数据集的前20条记录
peopleDF.show();
//查询某列所有的数据: select name from table
peopleDF.select("name").show();
// 查询某几列所有的数据,并对列进行计算: select name, age+10 as age2 from table
peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 10).as("age2")).show();
//根据某一列的值进行过滤: select * from table where age>19
peopleDF.filter(peopleDF.col("age") > 19).show();
//根据某一列进行分组,然后再进行聚合操作: select age,count(1) from table group by age
peopleDF.groupBy("age").count().show();
spark.stop();
}
}