Questions tagged [databricks]

For questions about the Databricks for use with the open source Apache Spark project.

0
votes
1answer
4 views

Reset hadoop aws keys to upload to another s3 bucket under different username

Sorry for horrible question title but here is my scenario I have a pyspark databricks notebook in which I am loading other notebooks. One of this notebooks is setting some redshift configuration for ...
0
votes
1answer
28 views

Import Data from Oracle using Spark

In Databricks I am using the following code to extract data from Oracle. %scala val empDF = spark.read .format("jdbc") .option("url", "jdbc:oracle:thin:username/password//hostname:port/sid")...
2
votes
2answers
38 views

Cannot create a Table in Microsoft Azure Databricks based on a Microsoft Azure SQL Database Table

I want to connect to a Microsoft Azure SQL Server and a Microsoft Azure SQL Database from my Microsoft Azure Databricks Notebook and do a SELECT and INSERT. Lets assume i have a Microsoft SQL Server ...
1
vote
0answers
24 views

DATEDIFF in SPARK SQl

I am new to Spark SQL. We are migrating data from SQL server to Databricks. I am using SPARK SQL . Can you please suggest how to achieve below functionality in SPARK sql for the below datefunctions. ...
1
vote
2answers
31 views

Unzip the multiple *.gz files and make one csv file in spark scala

I am having multiple files in S3 bucket and have to unzip these files and merge all files into a single file(CSV) with single header. All files are contains same header. The data files are looks like ...
1
vote
0answers
11 views

CosmosDB Gremlin: Resource with specified id or name already exists

I would like to Graph create a graph using gremlin API in cosmosDB. I'm getting data from a pyspark dataframe in databricks. I have several id's with filenames. The filenames are different although ...
0
votes
0answers
33 views

Azure Databricks write JSON Data to Parquet file throws error: TypeError: Can not infer schema for type

I am downloading in an Microsoft Azure Databrics Notebook with Python from an WebService following data: { "Customers" : [ { "CustomID" : "106219-891457", "...
4
votes
1answer
64 views

Pyspark SQL Pandas UDF: Returning an array

I'm trying to make a pandas UDF that takes in two columns with integer values and based on the difference between these values return an array of decimals whose length is equal to the aforementioned ...
1
vote
0answers
29 views

Connect to Azure SQL Data Warehouse from Azure Databricks Notebook

Using [this link][1] to setup my Databricks notebook to connect to Azure SQL. I'm trying to run a SQL query in the notebook. But getting the error at com.databricks.spark.sqldw.Utils$....
1
vote
1answer
15 views

How can I stop a DataBricks notebook referencing old versions of my egg files?

On DataBricks on Azure: I follow these steps: create a library from a python egg, say simon_1_001.egg which contains a module simon. attach the library to a cluster and restart the cluster attach a ...
0
votes
0answers
19 views

structured streaming - access azure blob storage contents from blob names

I have a structured streaming data frame of azure file names and want to read the blob contents into a new dataframe. How can i do this? def somefun(row): d = spark.read.option(row['blob']) query ...
1
vote
1answer
56 views

Storing data to database in PySpark (Azure - DataBricks) is very slow

I am working on big dataset which is has around 6000 million records, I have performed all calculation/operation successfully. At end while I am going to store data to databricks(DBFS) Database using ...
-1
votes
0answers
13 views

azure databricks writeStream is not working

I am using Azure Databricks. When I tried to write steaming data after sentiment analysis into blob storage. I got error message job aborted. below is error message. can you please advise? import org....
1
vote
1answer
37 views

Cannot install python packages thorugh PyPi in Azure Databricks

I want to call a webservice from a Databricks Notebook through Python. The needed library for this seems to be http.client. I have already found a code snippet to test this but when i try to execute ...
-2
votes
0answers
39 views

[Spark to handle optional xml tags using databricks]

I am trying to parse the xml file using databricks in spark. I am searching for a solution to parse xml in which some of the xml tags are optional(it may or may not come under its parent tag) I tried ...
2
votes
1answer
54 views

Azure Databricks vs ADLA for processing

Presently, I have all my data files in Azure Data Lake Store. I need to process these files which are mostly in csv format. The processing would be running jobs on these files to extract various ...
0
votes
1answer
33 views

Connect to Blob storage “no credentials found for them in the configuration”

I'm working with Databricks notebook backed by spark cluster. Having trouble trying to connect to the Azure blob storage. I used this link and tried the section Access Azure Blob Storage Directly - ...
0
votes
0answers
22 views

How to read a properties file in databricks in scala without using spark-shell? [duplicate]

I am able to read it as RDD val rdd = sc.textFile("dbfs:/mnt/abc/XYZ.properties") but the file has very less data and i dont want to read it as rdd as it will split the data in different nodes so ...
0
votes
0answers
10 views

failed to write stream data from event hub to blob using azure databricks

I tried to write stream data from event hub into blob on Azure Databricks. It failed after some times. below is error message. Can you please advise the cause? val query = streamingDataFrame ....
0
votes
0answers
25 views

XML to Dataframe using Pyspark

I am trying to scrap an XML file and create a dataframe from tags on the XML file. I working on Databricks using pyspark. XML File: <?xml version="1.0" encoding="UTF-8"?> <note> <...
-1
votes
0answers
19 views

download URL for databricks /FileStore contents not working. will CLI work if i used community edition?

I am experimenting w/ the Databricks cloud deployed Spark service. I created some data and would like to download it to my machine rather than lose it. This post: Databricks: Download a dbfs:/...
0
votes
0answers
13 views

How to reference same column with Databricks Spark Lead Windows function

I have a dataframe with a column that is only populated for the first in the sequence of rows: +----------------+----------+----------+ | Activity | date| StartID| +----------------+-----...
0
votes
0answers
24 views

Print an R dataframe with format in databricks

I have the following code in R: (taken from https://cran.r-project.org/web/packages/formattable/vignettes/formattable-data-frame.html) library(formattable) sign_formatter <- formatter("span", ...
1
vote
3answers
52 views

How can I use NiFi to read/write directly from ADLS without HDInsight

We would like to use NiFi to connect with ADLS (using PutHDFS and FetchHDFS) without having to install HDInsight. Subsequently we want to use Azure DataBricks to run Spark jobs, and hoping that it ...
0
votes
0answers
10 views

Lime_Framework_Expalinations

I am facing a strange issue , when running Lime framework . When i run it on single node my results are drastically different then from running on parallel nodes on spark . I am trying to run ...
1
vote
1answer
57 views

Can I use Hive on Azure Databricks without Hadoop/HDInsight?

The Docs says "Every Databricks deployment has a central Hive metastore..." besides an external metastore for existing Hive installations. I have an Azure Databricks workspace with an underlying ...
0
votes
0answers
11 views

Why the training and test data split is different in vm with 1CPUs and 2CPUs?

I am working on a Cloudera VM machine with only using 2 CPU for one of my projects and found that when I used randomsplit([o.8,o.2],seed=13234) to generate training and test data I got an output of ...
0
votes
0answers
17 views

Accessing AWS athena service from databricks using athena JDBC Driver (Simba jdbc jar)

I created a java application to connect to athena using AthenaJDBC jar (v4.2) and running that jar from a databricks notebook for executing queries. It works fine but i need to pass the IAM user ...
1
vote
0answers
22 views

Azure Databricks SparkException: Job aborted due to stage failure [closed]

I am using Azure Databricks to write streaming data into azure blob storage. The script as as followings. However, I got error message "Job aborted due to stage failure". Please see below for the ...
0
votes
1answer
48 views

Merging two parquet files with different schemas

I have two parquet files, Parquet A has 137 columns and Parquet B has 110 columns. Parquet A file has the entire history of the table. So Parquet A has all the fields for the entire history of the ...
0
votes
0answers
4 views

Training a model on databricks using scipy sparse arrrays

I have a pickled file that contains a scipy sparse array X and the labels y. How do i train a logistic regression model on it using Azure Databricks?
0
votes
0answers
28 views

Wordcloud using matplotlib is not showing

For my code, please see below: #tfids words word cloud import matplotlib.pyplot as plt from wordcloud import WordCloud import pandas as pd tf = pd.DataFrame(columns=['word']) tf['word'] = ['...
1
vote
0answers
28 views

Consume Secure Kafka from databricks spark cluster

I am trying to consume from a secure Kafka Topic ( using SASL_PLAINTEXT, ScramLogin Method). Spark Version 2.3.1 Scala 2.11 Kafka latest I am using the Spark Structured stream to construct the ...
0
votes
0answers
22 views

How to unit test a function in python which sends email through AWS SES

I want to perform unit testing on an email sending function in python. It is using aws ses to send the email through databricks. I tried exploring for many hours on python libraries like mock and moto ...
1
vote
1answer
35 views

How to kill job in Databricks

I have a long-running job, and if certain conditions are met, I would like to kill the job. This is traditionally done in python like: if some_condition: exit('job failed!) This works on ...
0
votes
1answer
24 views

Databricks Delta Update

How can we update multiple records in a table from other table using databricks delta. I want to achieve something like : update ExistingTable set IsQualified = updates.IsQualified From updates ...
0
votes
0answers
8 views

Not able to access outside variable into pyspark UDF [duplicate]

I've a dataframe which has N columns. I am iterating over all the columns because I want to derive new column from that column. For creating new column I need to pass two additional external variable ...
2
votes
0answers
36 views

Spark - Mixed case sensitivity in Spark DataFrame, Spark SQL, and/or Databricks Table

I have data from SQL Server that I need to manipulate in Apache Spark (Databricks). In SQL Server, three of this table's key columns use a case-sensitive COLLATION option, so that these specific ...
1
vote
0answers
20 views

Partially update Document with Pyspark in CosmosDB with MongoDB API

I'm using Azure Databricks with Pyspark and a CosmosDB with the MongoDB API. The following Pyspark command is being used to store a data_frame in the CosmosDB which works fine: def storeCollection(...
0
votes
1answer
22 views

Spark Scala FPGrowth without any results?

I'm trying to get some frequent item sets and assocation rules out of Spark MLLib using Scala. But actually I don't get anything, not even an error. The code (a spark/databricks notebook) and the data ...
2
votes
0answers
24 views

Replace main Jar in existing Spark Job in Databricks

I'm trying to replace the task jar on an existing spark job in Databricks via the Databricks REST API or Databricks CLI (which internally uses the REST API). I have been going through the ...
0
votes
0answers
34 views

.csv not a SequenceFile Failed with exception java.io.IOException:java.io.IOException

While creating External table with partition in hive using spark in csv format com.databricks.spark.csv it works fine but I can't able to open the table created in hive which is in .csv format from ...
0
votes
0answers
10 views

'module' object has no attribute '41531' Python Notebook Databricks error

I am trying to run simple unit test in Python Notebook in Azure Databricks. import unittest class KnownValues(unittest.TestCase): known_values = ((1, 'I'), (2, 'II'), ...
0
votes
2answers
62 views

Intersection of 2 dataframe with count in PySpark DataBricks

I want the intersection value of 2 dataframe(columns) on unique_ID match , and store intersection value in new_column-1 also get count of intersection data in new_column_3. Dataframe I have given ...
2
votes
1answer
102 views

Possible to handle multi character delimiter in spark [duplicate]

I have [~] as my delimiter for some csv files I am reading. 1[~]a[~]b[~]dd[~][~]ww[~][~]4[~]4[~][~][~][~][~] I have tried this val rddFile = sc.textFile("file.csv") val rddTransformed = rddFile....
1
vote
1answer
34 views

On Azure Databricks how can I tell which blob store is mounted

I have inherited a notebook which writes to a mounted Azure blob storage, using syntax: instrumentDf.write.json('/mnt/blobdata/cosmosdata/instrumentjson') How can I find the name of the Azure blob ...
-1
votes
0answers
7 views

How can we upload folder to sharepoint from python or databricks?

I have few folders to upload that contain some files to SharePoint I am using Databricks for this and language used by me is Python. So, in short, how can I upload a folder to SharePoint through ...
1
vote
0answers
19 views

Executing multiple Pyspark scripts in parallel

How can I initiate execution of multiple Pyspark scripts from one notebook, in parallel? Note: I'm currently using Azure's Databricks(enterprize edition)
-1
votes
0answers
49 views

PicklingError: Could not serialize object: TypeError: 'JavaPackage' object is not callable

from pyspark.sql.types import * from pyspark.sql.functions import udf def lookup_skill_pos(pos_id): match = (position_skill_sql['POSITION_ID'] == pos_id) skills = ...
0
votes
1answer
25 views

pyspark aggregating every n rows

I am new to pyspark and am trying to recreate a code I wrote in python. I am trying to create a new dataframe that has the averages of every 60 observations from the old dataframe. Here is the code I ...