File systems ============ Hadoop ------ To facilitate usage of the Vulcan Cluster, individual users are granted their own HDFS home directory. HDFS user home directories include 100 GB of usable storage on HDFS. Below are some common commands for interacting with HDFS. Additional information can also be found on the Apache Hadoop page at: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html .. code-block:: bash # list files in the root directory hadoop fs -ls / # list files in your HDFS home directory hadoop fs -ls /user/ # dot-slash is a shortcut to your HDFS home directory hadoop fs -ls ./ # view a text file hadoop fs -text ./file.txt.gz # retrieve a file from HDFS onto the local file system hadoop fs -get /user//myfile.txt ./local/file/path # upload a local file to HDFS hadoop fs -put ~/myfile.txt ./hdfs/file/path # view disk quota and usage hadoop fs -count -q -v -h ./ .. note:: HDFS home directory quotas are limited to 100 GB of usable disk space per user Data sets hosted on HDFS !!!!!!!!!!!!!!!!!!!!!!!! The Apache Parquet file format is a column-oriented data storage format that is well suited for big data applications. We have converted the original \*.csv files into \*.parquet format to facilitate file input in your Spark applications. * TransUnion .. code-block:: bash fileName = "/dataset/TransUnion_Data/2015/dgebooth_PubRec*.pq" df = spark.read.parquet(fileName) Booth Cloud ----------- Booth project shares can be mounted on Vulcan by request. Filepaths should be prefixed by :code:`file://` to differentiate them from :code:`hdfs://` file paths. .. code-block:: bash fileName = "file:///project//" df = spark.read.load(fileName, format="csv") Booth home directories are auto-mounted upon login. For this reason, files located in Booth home directories are only available when submitting Spark applications in `client` deploy mode. .. code-block:: bash fileName = "file:///home//" df = spark.read.load(fileName, format="csv")