- kafka and zookeeper finely installed with REAME_gemini.md (O)
- firewall.json finely imported into kafka (O)
- data exists in kafka confirmed (O)
- pyspark installed with pipenv (O)
- manually added a missing jar file into the pyspark env (O)
- trying to consume kafka with pyspark (O)
vi /etc/hosts
127.0.0.1 docker-kafka
- convert Dstream to dataframe (O)
- do sql query on dataframe (O)
- output dataframe to parquet file (O)
- execute cmd under project dir
cd <prject dir>
pipenv shell
spark-submit --jars /root/.local/share/virtualenvs/gemini_task-p6OkMWYi/lib/python3.7/site-packages/pyspark/jars/spark-streaming-kafka-0-8-assembly_2.11-2.4.1.jar ps_consumer.py > ps_consumer.err 2>&1
ls data/firewall.parquet
- the output of parquet dir should be first be removed everytime restarting the spark app
- change dataframe schema
- make the spark app output parquet continuously
- Spark
- Parquet