# Every record contains a label and feature vector
df=spark.createDataFrame(data,["label","features"])# Split the data into train/test datasets
train_df,test_df=df.randomSplit([.80,.20],seed=42)# Set hyperparameters for the algorithm
rf=RandomForestRegressor(numTrees=100)# Fit the model to the training data
model=rf.fit(train_df)# Generate predictions on the test dataset.
model.transform(test_df).show()
df=spark.read.csv("accounts.csv",header=True)# Select subset of features and filter for balance > 0
filtered_df=df.select("AccountBalance","CountOfDependents").filter("AccountBalance > 0")# Generate summary statistics
filtered_df.summary().show()
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-sql
The most widely-used
engine for scalable computing
Thousands of
companies, including 80% of the Fortune 500, use Apache Spark™. Over 2,000 contributors to
the open source project from industry and academia.
Ecosystem
Apache Spark™ integrates with your favorite frameworks, helping to scale them to thousands of machines.
Data science and Machine learning
SQL analytics and BI
Storage and Infrastructure
Spark SQL engine: under the hood
Apache Spark™ is built on an advanced distributed SQL engine
for large-scale data