The Big Data Show
The Big Data Show
  • 139
  • 899 471
Interview Question on Cache v/s Persist - Part 1
During a Data Engineering interview, you may be asked about concepts related to #apachespark
When working with large-scale data processing in Apache Spark, efficient data management is key to achieving optimal performance. Two commonly used methods for improving the efficiency of Spark jobs are cache() and persist(). While they might seem similar at first glance, they serve slightly different purposes and have distinct use cases. Let’s delve into the differences between these two methods and their use case along with the #interviewquestions
𝐀𝐫𝐭𝐢𝐜𝐥𝐞 & 𝐌𝐂𝐐 𝐋𝐢𝐧𝐤:
ua-cam.com/users/postUgkxtHWo5HNdQomArpA1qOeXi_szZzVfhJHP
In the video(part 1 & part 2) , you will learn:
🔅 Definition
🔅 Cache vs Persist
🔅 Use case in Spark Optimization
🔅 Storage Level of Persist
🔅 Serialization vs Deserialization
🔅 Demo using Spark UI
🔅 MCQ Question
𝐂𝐡𝐚𝐩𝐭𝐞𝐫𝐬:
- 0:00 Introduction
- 1:00 Definition
- 6:34 Persist Storage Level
- 09:51 Difference between cache and persist
- 11:37 Real-time Usecase & other important question
🔅 For scheduling a call for mentorship, mock interview preparation, 1:1 connect, collaboration - topmate.io/ankur_ranjan
🔅 LinkedIn - www.linkedin.com/in/thebigdatashow/
🔅 Instagram - ranjan_anku
🔅 Nisha's LinkedIn profile -
www.linkedin.com/in/engineer-nisha/
🔅 Ankur's LinkedIn profile - www.linkedin.com/in/thebigdatashow/
#dataengineering #datascience #bigdata #pyspark #dataanalytics #spark #interviewquestions #interview
Переглядів: 580

Відео

IPL Final 2024 Data Analysis: Building the Ultimate Scorecard with Pyspark
Переглядів 1,8 тис.28 днів тому
Have you ever wondered how crucial data analysis is for a cricket team's success? Thousands of Data Engineers, Data Analysts, and Data Scientists work tirelessly behind the scenes to craft winning strategies. In this session, we'll dive into an exciting IPL dataset and perform a transformation to build the Scorecard of the IPL Final 2024 featuring #SRHvKKR. In this video, you'll learn how to pe...
Salting in Apache Spark - Part II
Переглядів 644Місяць тому
In this video, we dive deep into the salting technique, a powerful method to tackle data skew issues in Spark. Data skew can significantly impact the performance of your Spark jobs by creating bottlenecks during data processing. Salting helps to evenly distribute the data across partitions, ensuring a smoother and more efficient processing flow. What You’ll Learn: 🔹 What is data skewness 🔹 How ...
Salting in Apache Spark - Part I
Переглядів 1,4 тис.Місяць тому
During a Data Engineering interview, you may be asked about concepts related to #apachespark This video explains the Salting technique. We will go in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. The salting technique in Apache Spark is a method used to address data skew. Data skew happens when certain keys have more data than others...
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
Переглядів 5 тис.Місяць тому
Data Engineering Mock Interview Join Nisha, an experienced Data Engineering professional with over 5 years of experience, and Sai Varun Kumar Namburi for an exciting and informative Data Engineering mock interview session. If you're preparing for a Data Engineering interview, this is the perfect opportunity to enhance your skills and increase your chances of success. The mock interview simulate...
How to read from APIs in PySpark codebase...
Переглядів 1,7 тис.Місяць тому
PySpark mini project: Dive into the world of big data processing with our PySpark Practice playlist. This series is designed for both beginners and seasoned data professionals looking to sharpen their Apache Spark skills through scenario-based questions and challenges. Not all the inputs come from storage files like JSON, CSV and other formats. There can be cases where you are given a scenario ...
Data Engineering Interview at top product based company | First Round
Переглядів 6 тис.Місяць тому
Data Engineering Mock Interview In top product-based companies like #meta #amazon #google #netflix etc, the first round of Data Engineering Interviews checks problem-solving skills. It mostly consists of screen-sharing sessions, where candidates are expected to solve multiple SQL and DSA problems, particularly in #python. We have tried to replicate the same things by asking multiple good SQL an...
What is topic, partition and offset in Kafka?
Переглядів 5592 місяці тому
This is the third video of our "Kafka for Data Engineers" playlist. In this video, we have tried understanding the topic, partition and offset Apache Kafka in depth. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮: 🔅 Topmate (For collaboration and Scheduli...
Brokers in Apache Kafka | Replication factor & ISR in Kafka
Переглядів 3602 місяці тому
This is the fourth video of our "Kafka for Data Engineers" playlist. In this video, we have tried to understand the brokers, replication factor and ISR. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼𝗻 𝗦𝗼𝗰𝗶𝗮𝗹 𝗠𝗲𝗱𝗶𝗮: 🔅 Topmate (For collaboration and Scheduling calls) - t...
Job, Stage and Task in Apache Spark | PySpark interview questions
Переглядів 1,2 тис.2 місяці тому
In this video, we explain the concept of Job, Stage and Task in Apache Spark or PySpark. We have gone in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. To reinforce your knowledge, we've created many problems for you to practice on the same topic in the community section of our UA-cam channel. You can find a link to all the questions ...
Unlocking Apache Kafka: The Secret Sauce of Event Streaming
Переглядів 7222 місяці тому
This is the second video of our "Apache Kafka for Data Engineers" playlist. In this video, we have tried understanding Apache Kafka in brief and then we have tried understanding the real meaning of event & event streaming. Understanding and imagining Apache Kafka at its core is very important to understand its concept deeply. Stay tuned to all to this playlist for all upcoming videos. 𝗝𝗼𝗶𝗻 𝗺𝗲 𝗼...
Unleashing #kafka Magic: What Data Engineers Do with Apache Kafka?
Переглядів 1,6 тис.2 місяці тому
This is the first video of our "Apache Kafka for Data Engineers" playlist. In this video, we have tried discussing one real use case or big data pipeline involving Kafka which is often used in the E-Commerce industry like Amazon, Walmart etc. It is very important to understand some of the real use cases of Apache Kafka in the Data Engineering domain. I hope this video will set up the tone for t...
Repartition vs. Coalesce in Apache Spark | PySpark interview questions
Переглядів 6372 місяці тому
During a Data Engineering interview, you may be asked about concepts related to #apachespark . In this video, we explain the difference between Repartition and Coalece in Apache Spark or PySpark. We go in-depth to help you understand the topic, but it's important to remember that theory alone may not be enough. To reinforce your knowledge, we've created over ten problems for you to practice on ...
Apache Spark End-To-End Data Engineering Project | Apple Data Analysis
Переглядів 30 тис.2 місяці тому
Apache Spark End-To-End Data Engineering Project | Apple Data Analysis
Sports Data Analysis using PySpark - Part 02
Переглядів 1,2 тис.2 місяці тому
Sports Data Analysis using PySpark - Part 02
Narrow vs. Wide Transformation in Apache Spark | PySpark interview questions
Переглядів 7562 місяці тому
Narrow vs. Wide Transformation in Apache Spark | PySpark interview questions
Sports Data Analysis using PySpark - Part 01
Переглядів 1,5 тис.2 місяці тому
Sports Data Analysis using PySpark - Part 01
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
Переглядів 6 тис.3 місяці тому
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
Data Engineering Interview
Переглядів 4,9 тис.3 місяці тому
Data Engineering Interview
Data Engineering Interview | PySpark Questions | Manager behavioural questions
Переглядів 7 тис.3 місяці тому
Data Engineering Interview | PySpark Questions | Manager behavioural questions
Data Engineering Interview at top product based company | First Round
Переглядів 11 тис.4 місяці тому
Data Engineering Interview at top product based company | First Round
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
Переглядів 8 тис.4 місяці тому
Big Data Mock Interview | Data Engineering Interview | First Round of Interview
Big Data Mock Interview | Data Engineering Interview
Переглядів 16 тис.4 місяці тому
Big Data Mock Interview | Data Engineering Interview
AWS Data Engineering Interview
Переглядів 25 тис.4 місяці тому
AWS Data Engineering Interview
Data Engineering Interview | System Design
Переглядів 23 тис.4 місяці тому
Data Engineering Interview | System Design
System Design round of #dataengineering interview
Переглядів 15 тис.4 місяці тому
System Design round of #dataengineering interview
First round of Big Data Engineering #interview
Переглядів 2,7 тис.5 місяців тому
First round of Big Data Engineering #interview
System Design round of Data Engineering #interview at top product-based company
Переглядів 41 тис.5 місяців тому
System Design round of Data Engineering #interview at top product-based company
Big Data Mock Interview | First Round
Переглядів 27 тис.5 місяців тому
Big Data Mock Interview | First Round
Data Engineering Mock Interview at Top Product Based Companies
Переглядів 10 тис.6 місяців тому
Data Engineering Mock Interview at Top Product Based Companies

КОМЕНТАРІ

  • @ramsompura125
    @ramsompura125 21 годину тому

    Hi, Thank you so much for sharing the project video.. I am getting this error when I run the code in jupyter notebook in windows .....Py4JJavaError: An error occurred while calling o209.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 58.0 failed 1 times, most recent failure: Lost task 0.0 in stage 58.0 (TID 46) (host.docker.internal executor driver): java.io.IOException: Mkdirs failed to create file:/C:/pythonwork/sparkwork/data/Apple Analysis/outputirpodsAfterIphone/_temporary/0/_temporary/attempt_202408032042443838464405057664188_0058_m_000000_46... can you please suggest?

  • @ganeshpai8095
    @ganeshpai8095 День тому

    For that python question: input = 'thebigdatashow' start = -1 * k end = len(input) - k for i in range(start,end): print(input[i],end="")

  • @AshiChaudhary-lc8tk
    @AshiChaudhary-lc8tk День тому

    -- Create table statement CREATE TABLE orders ( order_id INT, product_id INT, quantity INT ); -- Insert data into the table INSERT INTO orders (order_id, product_id, quantity) VALUES (1, 1, 12), (1, 2, 10), (1, 3, 5), (1, 3, 10), (2, 1, 4), (2, 1, 4), (2, 4, 4), (2, 5, 6), (3, 3, 5), (3, 4, 18), (4, 5, 2), (4, 6, 8), (5, 7, 9), (5, 8, 9), (3, 9, 20), (2, 9, 4); For anybody practicing :)

  • @PradipChavan-oz5dc
    @PradipChavan-oz5dc 2 дні тому

    Everyone talking about candidate.... She is excellent no doubt.. But. Interviewer helped her lead her to explain in dept

  • @connectwithanandsuresh
    @connectwithanandsuresh 3 дні тому

    I really ljked this mock interview because usually the mock interviews are genrally go into this depth level. Appreciate the efforts put into this. Can we have more aws data engineering interview similarto this pledge?

  • @user-dj4ht7rg2f
    @user-dj4ht7rg2f 4 дні тому

    This is what actual real world problem should be, love the video... This should reach more audience. PS: The strike rate of batsman is still incorrect, we have to include wide balls played by the batsman :)

    • @TheBigDataShow
      @TheBigDataShow 4 дні тому

      @@user-dj4ht7rg2f Thank you for your kind words. Kindly share with your friends

  • @m04d10y1996
    @m04d10y1996 5 днів тому

    Background music was unnecessary here.

    • @TheBigDataShow
      @TheBigDataShow 5 днів тому

      @@m04d10y1996 We will improve. This is a very old video and we were experimenting. Try checking the new videos

    • @m04d10y1996
      @m04d10y1996 5 днів тому

      @@TheBigDataShow content is really good and to the point. The community will definitely grow given the work being done.

  • @infotecsb3675
    @infotecsb3675 6 днів тому

    Thanks

  • @omkarshinde4792
    @omkarshinde4792 7 днів тому

    Hi, one question. why do we need to convert the dataframe to a dictionary when we can pass the dataframe directly to the transform function ??

    • @TheBigDataShow
      @TheBigDataShow 7 днів тому

      @@omkarshinde4792 DataFrame is not converted into Dictionary instead we have created Dictionary of DataFrame. Using Dictionary of DataFrame helps to return multiple DataFrame in a lot better way

    • @omkarshinde4792
      @omkarshinde4792 7 днів тому

      @@TheBigDataShow Got it. Thanks.

  • @AK-zs3we
    @AK-zs3we 10 днів тому

    Very informative ! 🤝

  • @SillyLittleMe
    @SillyLittleMe 10 днів тому

    Does anybody have any idea what this error means: DBFS file browser StorageContext com.databricks.backend.storage.StorageContextType$DbfsRoot$@5c512926 for workspace 3667672304132597 is not set in the CustomerStorageInfo. I am not sure what this means. For context, I had uploaded some files yesterday on DBFS for practice purposes. Those files are still available if I try to find them through notebooks, however, the DBFS tab can't show them and throws this error. Any help will be much appreciated! EDIT: It was a Databricks global issue, has been fixed now.

  • @kolodacool
    @kolodacool 11 днів тому

    Hey Manoj, great session on Data extraction via APIs. Few points I'd like to share from my experience working on this: 1) While dealing with huge volumes of data from source it's crucial to involve Pagination to iteratively collect all data. 2) Admins who manage these end_point URLs usually discourage having multiple API calls within a certain timeframe which would cause a dead-lock on your batch id. Like you suggested either have bulk data pulled all at once or optimize our framework.

  • @footballalchemist
    @footballalchemist 12 днів тому

    Just completed this amazing project 😍 Can i add this in my portfolio?

    • @anandbagate2347
      @anandbagate2347 9 днів тому

      Hello I have watched the whole video and code it, but getting some error, do you have entire code, so I can make changes accordingly

    • @TheBigDataShow
      @TheBigDataShow 9 днів тому

      Please check the description

  • @DharmajiD
    @DharmajiD 12 днів тому

    I see this when trying to upload files using DBFS file browser Missing credentials to access AWS bucket

  • @shouviksharma7621
    @shouviksharma7621 13 днів тому

    Thanks for the content, really beneficial.

  • @AnjaliH-wo4hm
    @AnjaliH-wo4hm 14 днів тому

    Good efforts however please minimize the usage of words like "perfect" and "well & good" after every sentence

  • @AkshayBaishander
    @AkshayBaishander 18 днів тому

    Great explanation thanks

  • @shobhitsharma2137
    @shobhitsharma2137 19 днів тому

    subtitles are not fully visible

  • @MrTejasreddy
    @MrTejasreddy 20 днів тому

    super manoj..txs for u time as you said its really tough usecase if possible do it with easy data set next time bcz every one can understand easliy keep going....👌

    • @manojt3164
      @manojt3164 19 днів тому

      Thanks a lot for your kind words. Sure I'll keep that in mind

  • @DataEngineerPratik
    @DataEngineerPratik 20 днів тому

    My Doubt is --> In spark we have cache and persist, used to save the RDD. As per my understanding cache and persist/MEMORY_AND_DISK both perform same action for DataFrames. If this is the case why should I prefer using cache at all, I can always use persist [with different parameters] and ignore cache. Could you please let me know, when to use cache, or if my understanding is wrong.

    • @manojt3164
      @manojt3164 19 днів тому

      that's a good question. with cache, it stores data in memory in deserialized form it's also the same case with persist(MEMORY_ONLY). cache(), it is just a synonym for using persist() with the default storage level (StorageLevel.MEMORY_ONLY), which means, cache the data in memory and not to the disk. So, both of them store the data in-memory deserialized format.

    • @DataEngineerPratik
      @DataEngineerPratik 18 днів тому

      @@manojt3164 ok manoj thanks

  • @RaviShankarPoosaRaviKumar
    @RaviShankarPoosaRaviKumar 22 дні тому

    with tab1 as( SELECT order_id, max(quantity) as max_value, avg(quantity) as avg_quantity FROM orders GROUP BY order_id) SELECT order_id FROM tab1 WHERE max_value > ALL(SELECT avg_quantity FROM tab1 )

  • @shafimahmed7711
    @shafimahmed7711 22 дні тому

    Thank you for your time and efforts. Its not a easy job 👏👏👏

    • @TheBigDataShow
      @TheBigDataShow 4 дні тому

      @@shafimahmed7711 Thank you for your kind words and appreciating our efforts. Please share within your network on LinkedIn and Twitter. It will motivate us to do more videos like this.

  • @akhilsingh3801
    @akhilsingh3801 23 дні тому

    important content is Questions asked 😅😅

  • @mohinraffik5222
    @mohinraffik5222 26 днів тому

    Appreciate your great effort and share your knowledge brother!👍

  • @rationalthinker3706
    @rationalthinker3706 28 днів тому

    please add the dataset

    • @manojt7012
      @manojt7012 28 днів тому

      drive.google.com/drive/folders/1bH-38DLQWu46m0asGyaTqyspwJiwsxvH?usp=drive_link

    • @TheBigDataShow
      @TheBigDataShow 28 днів тому

      Kindly check in the description.

  • @Aman-lv2ee
    @Aman-lv2ee 28 днів тому

    can you add the dataset link

    • @manojt7012
      @manojt7012 28 днів тому

      Thanks for the response. drive.google.com/drive/folders/1bH-38DLQWu46m0asGyaTqyspwJiwsxvH?usp=drive_link

  • @manishkumartiwari420
    @manishkumartiwari420 29 днів тому

    Can you please help us with the dataset?

  • @rationalthinker3706
    @rationalthinker3706 29 днів тому

    awesome sir

    • @TheBigDataShow
      @TheBigDataShow 23 дні тому

      Thank you for you kind words. Keep learning :)

  • @payalbhatia6927
    @payalbhatia6927 29 днів тому

    which pentab/device is used for video , can you please share ?

  • @AshishDukare-vr6xb
    @AshishDukare-vr6xb Місяць тому

    with cte as ( select order_id,AVG(quantity) as avg_quantity from t1 group by order_id ) select order_id from t1 where quantity > cte. avg_quantity and order_id=cte.order_id;

  • @AshishDukare-vr6xb
    @AshishDukare-vr6xb Місяць тому

    how come the project discussion is happening in the first round. Usually, they asked about Python, and SQL questions in the first round to check the basic foundation. Correct me if I am wrong here

  • @AshishDukare-vr6xb
    @AshishDukare-vr6xb Місяць тому

    Don't you think his intro was too long and the interviewer has to cut him in between to ask questions quickly?

  • @atharvagaikwad9619
    @atharvagaikwad9619 Місяць тому

    Why would you setup a session in spark when you already get it ?

    • @manojt7012
      @manojt7012 28 днів тому

      That's right. With notebook, spark session would be already created it could have been coded with creating sparksession itself

  • @maazahmedansari4334
    @maazahmedansari4334 Місяць тому

    Replied in my previous question but it seems not visible so making again. Getting in first pipeline. AnalysisException: Failed to merge fields 'customer_id' and 'customer_id' Any suggestion would be appreciated. Thank you Please find the ap code I am trying to follow along here: github.com/maaz-ahmed-ansari/apple-product-analysis/tree/main

    • @maazahmedansari4334
      @maazahmedansari4334 27 днів тому

      2nd pipeline is working as expected. Still bashing my mind around 1st pipeline. Can someone suggest how to resolve above error?

  • @maazahmedansari4334
    @maazahmedansari4334 Місяць тому

    Getting in first pipeline. AnalysisException: Failed to merge fields 'customer_id' and 'customer_id' Any suggestion would be appreciated. Thank you

    • @TheBigDataShow
      @TheBigDataShow Місяць тому

      @@maazahmedansari4334 please share your some more code snippet for debugging and have your created some GitHub repo for same.

  • @yashbhosle3582
    @yashbhosle3582 Місяць тому

    SELECT name, department_name, MAX(DATEDIFF(promotion_date, hire_date)) AS longest_time FROM employee JOIN department ON employee.dept_id = department.dept_id JOIN promotion ON employee.employee_id = promotion.employee_id GROUP BY department_name ORDER BY longest_time DESC;

  • @mufaddalrampurawala247
    @mufaddalrampurawala247 Місяць тому

    This also increases the data size of the second dataset as we explode it, so is it still optimized as the data scan will be increased a lot and lot of shuffle will be involved?

    • @nishabansal2978
      @nishabansal2978 Місяць тому

      While salting can increase the data size and shuffle overhead in Spark, its benefits in mitigating data skewness and improving workload distribution often outweigh these drawbacks. The other important thing is to decide on salting factor to choose for your workload as that will again impact the overall distribution

  • @TheBigDataShow
    @TheBigDataShow Місяць тому

    A practical demonstration will be relaxed tomorrow. Kindly watch this video to understand the theory in depth.

  • @rationalthinker3706
    @rationalthinker3706 Місяць тому

    Thank you , waiting

    • @TheBigDataShow
      @TheBigDataShow Місяць тому

      A practical demonstration will be relaxed tomorrow. Kindly watch this video to understand the theory in depth.

  • @arpanmitra1994
    @arpanmitra1994 Місяць тому

    nums = [0,0,2,3,3,3,3,5,5] k = 2 new_nums = [] for i in nums: if nums.count(i) == k: if i not in new_nums: new_nums.append(i) print(new_nums)