SparkToPandas package

Submodules

SparkToPandas.SparkToPandas module

SparkToPandas Documentation

SparkToPandas is a simple plugin alongside of spark, the SparkToPandas was designed to work with pyspark with a syntax more similar to pandas.

class SparkToPandas.SparkToPandas.Spark_pandas(spark)[source]

Bases: object

A supporting functions for pyspark ,which has the syntax similar to pandas

barChart(df, x, y, hue, title, aspect='horizontal')[source]

Plots a barchart using the seaborn module

Parameters
  • df – dataframe

  • x – str

  • y – str

  • hue – str

  • title – str

  • aspect – str

Returns

None

column_creator(df, primary_column, new_column_name, user_func)[source]

Creates a new column based on user defined function and returns the new rdd

Parameters
  • df – dataframe

  • primary_column – str

  • new_column_name – str

  • user_func – function

Returns

dataframe

conditional_func(x)[source]

A sample function, to add x+1 number

Parameters

x – int

Returns

int

drop_na(df, col_name=None)[source]

Drops null values based on user choice. Supports dropping all null values or dropping null values based on column subset

Parameters
  • df – dataframe

  • col_name – str

Returns

dataframe

fillna(df, value, col_name=None)[source]

Fills null values based on user choice.

Parameters
  • df – dataframe

  • value – int/str/float

  • col_name – str

Returns

dataframe

head(df, n)[source]

Prints the head and tail of the dataframe depending on user’s choice.

Parameters
  • df – dataframe

  • n – int

Returns

None

read_csv(file_location, header=True)[source]

Function to read csv file as a spark rdd

Parameters
  • file_location – str

  • header – bool

Returns

rdd

sort_df(df, col_name, ascending=True)[source]

Function to sort the dataframe in ascending or descending order based on the columns given

Parameters
  • df – dataframe

  • col_name – list

  • ascending – bool

Returns

dataframe

subset_columns(column_names, df)[source]

Returns a dataframe which the user specified column names.

Parameters
  • column_names – list

  • df – dataframe

Returns

dataframe

Module contents