DataFrame Style functions
Sedona SQL functions can be used in a DataFrame style API similar to Spark functions.
The following objects contain the exposed functions: org.apache.spark.sql.sedona_sql.expressions.st_functions
, org.apache.spark.sql.sedona_sql.expressions.st_constructors
, org.apache.spark.sql.sedona_sql.expressions.st_predicates
, and org.apache.spark.sql.sedona_sql.expressions.st_aggregates
.
Every function can take all Column
arguments. Additionally, overloaded forms can commonly take a mix of String
and other Scala types (such as Double
) as arguments.
In general the following rules apply (although check the documentation of specific functions for any exceptions):
- Every function returns a
Column
so that it can be used interchangeably with Spark functions as well asDataFrame
methods such asDataFrame.select
orDataFrame.join
. - Every function has a form that takes all
Column
arguments. These are the most versatile of the forms. - Most functions have a form that takes a mix of
String
arguments with other Scala types.
Column
type arguments are passed straight through and are always accepted.str
type arguments are always assumed to be names of columns and are wrapped in aColumn
to support that. If an actual string literal needs to be passed then it will need to be wrapped in aColumn
usingpyspark.sql.functions.lit
.- Any other types of arguments are checked on a per function basis. Generally, arguments that could reasonably support a python native type are accepted and passed through. 4. Shapely
Geometry
objects are not currently accepted in any of the functions.
The exact mixture of argument types allowed is function specific.
However, in these instances, all String
arguments are assumed to be the names of columns and will be wrapped in a Column
automatically.
Non-String
arguments are assumed to be literals that are passed to the sedona function. If you need to pass a String
literal then you should use the all Column
form of the sedona function and wrap the String
literal in a Column
with the lit
Spark function.
A short example of using this API (uses the array_min
and array_max
Spark functions):
val values_df = spark.sql("SELECT array(0.0, 1.0, 2.0) AS values")
val min_value = array_min("values")
val max_value = array_max("values")
val point_df = values_df.select(ST_Point(min_value, max_value).as("point"))
from pyspark.sql import functions as f
from sedona.spark import *
df = spark.sql("SELECT array(0.0, 1.0, 2.0) AS values")
min_value = f.array_min("values")
max_value = f.array_max("values")
df = df.select(ST_Point(min_value, max_value).alias("point"))
The above code will generate the following dataframe:
+-----------+
|point |
+-----------+
|POINT (0 2)|
+-----------+
Some functions will take native python values and infer them as literals. For example:
from sedona.spark import *
df = df.select(ST_Point(1.0, 3.0).alias("point"))
This will generate a dataframe with a constant point in a column:
+-----------+
|point |
+-----------+
|POINT (1 3)|
+-----------+