File systems

Hadoop

To facilitate usage of the Vulcan Cluster, individual users are granted their own HDFS home directory. HDFS user home directories include 100 GB of usable storage on HDFS. Below are some common commands for interacting with HDFS. Additional information can also be found on the Apache Hadoop page at: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

# list files in the root directory
hadoop fs -ls /

# list files in your HDFS home directory
hadoop fs -ls /user/<boothId>

# dot-slash is a shortcut to your HDFS home directory
hadoop fs -ls ./

# view a text file
hadoop fs -text ./file.txt.gz

# retrieve a file from HDFS onto the local file system
hadoop fs -get /user/<boothId>/myfile.txt ./local/file/path

# upload a local file to HDFS
hadoop fs -put ~/myfile.txt ./hdfs/file/path

# view disk quota and usage
hadoop fs -count -q -v -h ./

Note

HDFS home directory quotas are limited to 100 GB of usable disk space per user

Data sets hosted on HDFS

The Apache Parquet file format is a column-oriented data storage format that is well suited for big data applications. We have converted the original *.csv files into *.parquet format to facilitate file input in your Spark applications.

TransUnion

fileName = "/dataset/TransUnion_Data/2015/dgebooth_PubRec*.pq"

df = spark.read.parquet(fileName)

Booth Cloud

Booth project shares can be mounted on Vulcan by request. Filepaths should be prefixed by file:// to differentiate them from hdfs:// file paths.

fileName = "file:///project/<projectName>/<filename.csv>"

df = spark.read.load(fileName, format="csv")

Booth home directories are auto-mounted upon login. For this reason, files located in Booth home directories are only available when submitting Spark applications in client deploy mode.

fileName = "file:///home/<BoothID>/<filename.csv>"

df = spark.read.load(fileName, format="csv")