Data Formats

We encounter different kinds of data formats in while dealing with big data. Some popular formats include

  1. delimiter-separated value (delimiters can be comma (CSV), tabs (tsv) etc.)
  2. XML
  3. JSON
  4. Avro
  5. Parquet

There are always challenges while reading different formats because of the sheer complexity configuring for each dataset in different data formats.

Spark provides two main types of distributed data structures:

  • DataFrames
  • Datasets

Dataframes

Let us read a simple JSON shown below


val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("/user//datasets/curr.json")
df: org.apache.spark.sql.DataFrame = [base: string, date: string, rates: struct<AUD:double,BGN:double,BRL:double,CAD:double,CHF:double,CNY:double,CZK:double,DKK:double,GBP:double,HKD:double,HRK:double,HUF:double,IDR:double,ILS:double,INR:double,JPY:double,KRW:double,MXN:double,MYR:double,NOK:double,NZD:double,PHP:double,PLN:double,RON:double,RUB:double,SEK:double,SGD:double,THB:double,TRY:double,USD:double,ZAR:double>]
scala> df.show()
+----+----------+--------------------+
|base| date| rates|
+----+----------+--------------------+
| EUR|2016-04-22|[1.457,1.9558,4.0...|
+----+----------+--------------------+
scala> df.printSchema()
root
|-- base: string (nullable = true)
|-- date: string (nullable = true)
|-- rates: struct (nullable = true)
| |-- AUD: double (nullable = true)
| |-- BGN: double (nullable = true)
| |-- BRL: double (nullable = true)
| |-- CAD: double (nullable = true)
| |-- CHF: double (nullable = true)
| |-- CNY: double (nullable = true)
| |-- CZK: double (nullable = true)
| |-- DKK: double (nullable = true)
| |-- GBP: double (nullable = true)
| |-- HKD: double (nullable = true)
| |-- HRK: double (nullable = true)
| |-- HUF: double (nullable = true)
| |-- IDR: double (nullable = true)
| |-- ILS: double (nullable = true)
| |-- INR: double (nullable = true)
| |-- JPY: double (nullable = true)
| |-- KRW: double (nullable = true)
| |-- MXN: double (nullable = true)
| |-- MYR: double (nullable = true)
| |-- NOK: double (nullable = true)
| |-- NZD: double (nullable = true)
| |-- PHP: double (nullable = true)
| |-- PLN: double (nullable = true)
| |-- RON: double (nullable = true)
| |-- RUB: double (nullable = true)
| |-- SEK: double (nullable = true)
| |-- SGD: double (nullable = true)
| |-- THB: double (nullable = true)
| |-- TRY: double (nullable = true)
| |-- USD: double (nullable = true)
| |-- ZAR: double (nullable = true)
df.select(df("rates")("USD")).show()
+----------+
|rates[USD]|
+----------+
| 1.1263|
+----------+

other operations on the data frame can be found here in the spark documentation.

Datasets

Datasets are distributed data (similar to RDDs) with a mild enhancement. It leverages an encoders for serialization and can perform operations like filtering,sorting and hashing without deserializing the object.

Check out an example here.

Other examples to follow for different formats and their benefits for different applications.

 

 

Advertisements